Table of contents¶

  • Desciption
    • Source Of Data
    • Goal Of Data
    • Introduction
    • Objective
  • Import Libraries
  • Load dataset
  • UnderStand And Manipulation Data

    • UnderStand
    • Delete Unusful data
      • Drop Duplicate
      • Drop Irrelevant Columns
    • Data integration
      • ZipCode
    • Manipulate Columns Wrong Formating

    • Data Analysis

      • Outliers
      • Descriptive Statistics Analysis
      • Categorise Price
      • Price Per Sqft
        • price per Sqft each city
      • House Age
        • Houses built
        • Most Modified Homes
        • House Age vs Average price
        • Yearly Monthly Transactions
        • Basement impact
        • House Features Vs Price
          • Grade
          • Bedrooms
          • Bathrooms
          • Floors
          • Condition of house
            • Average Price Vs All Features
  • Machine Learning
    • Feature Engineer
    • Regression Models
      • Logistic Regression Model
        • Model Selection
        • Model Evaluation
        • Model Graph
      • Linear Regression Model
        • Model Selection
        • Model Evaluation
        • Model Graph
      • Gradient Boosting Regressor
        • Model Selection
        • Model Evaluation
        • Model Graph
        • neural networkre gressor
        • Model Selection
        • Model Evaluation
        • Model Graph
        • Random Forest Regressor
        • Model Selection
        • Model Evaluation
        • Model Graph
        • Extra Trees Regressor
        • Model Selection
        • Model Evaluation
        • Model Graph
        • Compare Between Model
        • Classification Models
        • Logistic Regression CLF Model
        • Model Selection
        • Model Evaluation
        • Model Graph
        • Decision Tree Classifier
        • Model Selection
        • Model Evaluation
        • Model Graph
        • Random Forest Classifier
        • Model Selection
        • Model Evaluation
        • Model Graph
        • Gradient Boosting Classifier
        • Model Selection
        • Model Evaluation
        • Model Graph
        • AdaBoost Classifier
        • Model Selection
        • Model Evaluation
        • Model Graph
        • Support Vector Classifier
        • Model Selection
        • Model Evaluation
        • Model Graph
        • Compare Between Model
  • Deployment
    • DeploymentRegression
    • DeploymentClassification
  • Conclusion
  • Future Work

Desciption¶

Source¶

The source from which I obtained this data is" https://www.kaggle.com/datasets/shivachandel/kc-house-data

Goal¶

Goal: To analyze the housing data set, understand the trends and patterns in the housing market, and build a predictive model that can accurately estimate the price of a house based on its features. This analysis will help potential buyers and sellers to make informed decisions and aid real estate agents in providing better recommendations to their clients. The goal is to identify the key factors that influence the pricing of houses, such as location, size, condition, and other features, and develop a model that can predict house prices with a high degree of accuracy. Additionally, the data will be explored to understand geographical trends, identify any outliers or anomalies, and evaluate any potential bias in the data set. Finally, the insights gained from the analysis will be presented in a clear and concise manner using visualizations and other descriptive statistics.

Introduction¶

Introduction: This data set contains information on residential properties sold between May 2014 and May 2015 in King County, Washington state, USA. The data includes details such as the price, number of bedrooms and bathrooms, square footage of living space and lot size, number of floors, whether the property has a waterfront view or not, and other features, such as the year built and year renovated. The data set contains 21,613 observations and 21 attributes. The goal of this data set is to provide insights into the factors that affect the price of a house and to build a predictive model that can accurately estimate the price of a house based on its features. This data set can be used by real estate agents, buyers, and sellers to make informed decisions and to gain a better understanding of the housing market in King County. The data set can also be used by researchers and data scientists to explore and analyze trends and patterns in the housing market. The data set is publicly available and can be downloaded from various online sources.

Objective¶

Objectives:

  1. To analyze the factors that influence the price of residential properties in King County, Washington state.
  2. To build a predictive model that can accurately estimate the price of a house based on its features.
  3. To identify the most important features that affect the price of a house.
  4. To explore trends and patterns in the housing market in King County.
  5. To provide insights and recommendations to real estate agents, buyers, and sellers to make informed decisions.
  6. To compare the performance of different machine learning algorithms in predicting house prices.
  7. To evaluate the effectiveness of feature engineering techniques in improving the accuracy of the predictive model.
  8. To identify outliers and anomalies in the data set and determine their impact on the predictive model.
  9. To evaluate the robustness of the predictive model using cross-validation techniques.
  10. To provide a comprehensive analysis and interpretation of the data set for researchers and data scientists.

Libraries¶

In [1]:
#Import necessary libraries for basic data processing
import math
import numpy as np
import pandas as pd
import country_converter as coco
import time
from uszipcode import SearchEngine
#Import libraries for visualization
import seaborn as sns
import matplotlib.pyplot as plt

#Import libraries for modeling
#Convert categorical data into numerical data
from sklearn.preprocessing import LabelEncoder

#Split data into training and testing sets
from sklearn.model_selection import train_test_split

#Scale data
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import MinMaxScaler
#Import regression models
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,ExtraTreesRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR

#Import classification metrics
from sklearn.metrics import classification_report,r2_score, mean_squared_error,accuracy_score,confusion_matrix, roc_curve, auc, precision_recall_curve
#Import learning curve
from sklearn.metrics import roc_curve, roc_auc_score, auc
#Import classification models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# iport scipy
from scipy import stats
# Import Deploying 
import joblib
import pickle
# Warning
import warnings
warnings.filterwarnings("ignore")
C:\Users\Mohamed\anaconda3\lib\site-packages\fuzzywuzzy\fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

Dataset¶

In [2]:
#load train dataset
data= pd.read_csv('kc_house_data.csv')
#show data
data
Out[2]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 20141013T000000 221900.0 3 1.00 1180 5650 1.0 0 0 ... 7 1180.0 0 1955 0 98178 47.5112 -122.257 1340 5650
1 6414100192 20141209T000000 538000.0 3 2.25 2570 7242 2.0 0 0 ... 7 2170.0 400 1951 1991 98125 47.7210 -122.319 1690 7639
2 5631500400 20150225T000000 180000.0 2 1.00 770 10000 1.0 0 0 ... 6 770.0 0 1933 0 98028 47.7379 -122.233 2720 8062
3 2487200875 20141209T000000 604000.0 4 3.00 1960 5000 1.0 0 0 ... 7 1050.0 910 1965 0 98136 47.5208 -122.393 1360 5000
4 1954400510 20150218T000000 510000.0 3 2.00 1680 8080 1.0 0 0 ... 8 1680.0 0 1987 0 98074 47.6168 -122.045 1800 7503
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21608 263000018 20140521T000000 360000.0 3 2.50 1530 1131 3.0 0 0 ... 8 1530.0 0 2009 0 98103 47.6993 -122.346 1530 1509
21609 6600060120 20150223T000000 400000.0 4 2.50 2310 5813 2.0 0 0 ... 8 2310.0 0 2014 0 98146 47.5107 -122.362 1830 7200
21610 1523300141 20140623T000000 402101.0 2 0.75 1020 1350 2.0 0 0 ... 7 1020.0 0 2009 0 98144 47.5944 -122.299 1020 2007
21611 291310100 20150116T000000 400000.0 3 2.50 1600 2388 2.0 0 0 ... 8 1600.0 0 2004 0 98027 47.5345 -122.069 1410 1287
21612 1523300157 20141015T000000 325000.0 2 0.75 1020 1076 2.0 0 0 ... 7 1020.0 0 2008 0 98144 47.5941 -122.299 1020 1357

21613 rows × 21 columns

reed data and show it

The columns of the dataset are, according to the data source, described as it follows:¶

   "id" : A unique identifier for each record in the dataset.
   "date": The date on which the property was sold.
   'price': The price of the property in USD.
   'bedrooms': The number of bedrooms in the property.
   'bathrooms': The number of bathrooms in the property.
   'sqft_living': The size of the property's living space in square feet.
   'sqft_lot': The size of the property's lot in square feet.
   'floors': The number of floors in the property.
   'waterfront': A binary variable indicating whether the property is located on a waterfront or not.
   'view': A rating of the property's view from 0 to 4.
   'condition': A rating of the property's condition from 1 to 5.
   'grade': A rating of the property's overall grade from 1 to 13.
   'sqft_above': The size of the property's living space above ground level in square feet.
   'sqft_basement': The size of the property's living space below ground level in square feet.
   'yr_built': The year in which the property was built.
   'yr_renovated': The year in which the property was last renovated.
   'zipcode': The zipcode of the area in which the property is located.
   'lat': The latitude coordinate of the property's location.
   'long': The longitude coordinate of the property's location.
   'sqft_living15': The average size of nearby houses' living space in square feet.
   'sqft_lot15': The average size of nearby houses' lots in square feet.
In [ ]:
 

understand¶

First we run a quick analysis on the dataset itself, to get its quality overall, and get a perspective of what's going to be the first step to process and interpret the data.

In [3]:
# Print the number of rows and columns in the dataset
print(f'The dataset has {data.shape[0]} rows and {data.shape[1]} columns\n')

# Print a separator line
print('- -' * 30)

# Print value counts for each column in the dataset
print('Value counts for each column: \n')
for i in data.columns:
    # Print the name of the column
    print(f'===== {i} =====\n')
    
    # Print the value counts for each unique value in the column, sorted in descending order
    print(data[i].value_counts().sort_values(ascending=False))
    
    # Print a separator line between each column's value counts
    print('--' * 30)
The dataset has 21613 rows and 21 columns

- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -
Value counts for each column: 

===== id =====

795000620     3
5282200015    2
8832900780    2
526059224     2
5101405604    2
             ..
993001976     1
525049174     1
4187000190    1
6056110780    1
1523300157    1
Name: id, Length: 21436, dtype: int64
------------------------------------------------------------
===== date =====

20140623T000000    142
20140625T000000    131
20140626T000000    131
20140708T000000    127
20150427T000000    126
                  ... 
20141130T000000      1
20140803T000000      1
20150527T000000      1
20150110T000000      1
20140727T000000      1
Name: date, Length: 372, dtype: int64
------------------------------------------------------------
===== price =====

350000.0    172
450000.0    172
550000.0    159
500000.0    152
425000.0    150
           ... 
514700.0      1
388598.0      1
471275.0      1
521500.0      1
402101.0      1
Name: price, Length: 4028, dtype: int64
------------------------------------------------------------
===== bedrooms =====

3     9824
4     6882
2     2760
5     1601
6      272
1      199
7       38
0       13
8       13
9        6
10       3
11       1
33       1
Name: bedrooms, dtype: int64
------------------------------------------------------------
===== bathrooms =====

2.50    5380
1.00    3852
1.75    3048
2.25    2047
2.00    1930
1.50    1446
2.75    1185
3.00     753
3.50     731
3.25     589
3.75     155
4.00     136
4.50     100
4.25      79
0.75      72
4.75      23
5.00      21
5.25      13
0.00      10
5.50      10
1.25       9
6.00       6
0.50       4
5.75       4
6.75       2
8.00       2
6.25       2
6.50       2
7.50       1
7.75       1
Name: bathrooms, dtype: int64
------------------------------------------------------------
===== sqft_living =====

1300    138
1400    135
1440    133
1660    129
1010    129
       ... 
2478      1
1496      1
3402      1
1061      1
1425      1
Name: sqft_living, Length: 1038, dtype: int64
------------------------------------------------------------
===== sqft_lot =====

5000    358
6000    290
4000    251
7200    220
4800    120
       ... 
914       1
4396      1
1449      1
1902      1
1076      1
Name: sqft_lot, Length: 9782, dtype: int64
------------------------------------------------------------
===== floors =====

1.0    10680
2.0     8241
1.5     1910
3.0      613
2.5      161
3.5        8
Name: floors, dtype: int64
------------------------------------------------------------
===== waterfront =====

0    21450
1      163
Name: waterfront, dtype: int64
------------------------------------------------------------
===== view =====

0    19489
2      963
3      510
1      332
4      319
Name: view, dtype: int64
------------------------------------------------------------
===== condition =====

3    14031
4     5679
5     1701
2      172
1       30
Name: condition, dtype: int64
------------------------------------------------------------
===== grade =====

7     8981
8     6068
9     2615
6     2038
10    1134
11     399
5      242
12      90
4       29
13      13
3        3
1        1
Name: grade, dtype: int64
------------------------------------------------------------
===== sqft_above =====

1300.0    212
1010.0    210
1200.0    206
1220.0    192
1140.0    184
         ... 
2864.0      1
2716.0      1
1572.0      1
3281.0      1
1425.0      1
Name: sqft_above, Length: 946, dtype: int64
------------------------------------------------------------
===== sqft_basement =====

0       13126
600       221
700       218
500       214
800       206
        ...  
2180        1
225         1
276         1
1248        1
248         1
Name: sqft_basement, Length: 306, dtype: int64
------------------------------------------------------------
===== yr_built =====

2014    559
2006    454
2005    450
2004    433
2003    422
       ... 
1933     30
1901     29
1902     27
1935     24
1934     21
Name: yr_built, Length: 116, dtype: int64
------------------------------------------------------------
===== yr_renovated =====

0       20699
2014       91
2013       37
2003       36
2005       35
        ...  
1951        1
1959        1
1948        1
1954        1
1944        1
Name: yr_renovated, Length: 70, dtype: int64
------------------------------------------------------------
===== zipcode =====

98103    602
98038    590
98115    583
98052    574
98117    553
        ... 
98102    105
98010    100
98024     81
98148     57
98039     50
Name: zipcode, Length: 70, dtype: int64
------------------------------------------------------------
===== lat =====

47.6624    17
47.6846    17
47.5491    17
47.5322    17
47.6955    16
           ..
47.2920     1
47.3698     1
47.2839     1
47.2995     1
47.6502     1
Name: lat, Length: 5034, dtype: int64
------------------------------------------------------------
===== long =====

-122.290    116
-122.300    111
-122.362    104
-122.291    100
-122.363     99
           ... 
-122.447      1
-121.797      1
-122.491      1
-121.837      1
-121.403      1
Name: long, Length: 752, dtype: int64
------------------------------------------------------------
===== sqft_living15 =====

1540    197
1440    195
1560    192
1500    181
1460    169
       ... 
2238      1
2616      1
1427      1
2456      1
2927      1
Name: sqft_living15, Length: 777, dtype: int64
------------------------------------------------------------
===== sqft_lot15 =====

5000     427
4000     357
6000     289
7200     211
4800     145
        ... 
6801       1
9937       1
26027      1
4795       1
2007       1
Name: sqft_lot15, Length: 8689, dtype: int64
------------------------------------------------------------

This code performs an exploratory data analysis on a dataset.

  1. The first line of code prints the number of rows and columns in the dataset using the shape attribute of the Pandas DataFrame.

  2. The second line of code prints a separator line to visually separate the output.

  3. The third line of code initiates a loop that iterates over each column in the dataset.

  4. The fourth line of code prints the name of the current column being analyzed.

  5. The fifth line of code prints the value counts for each unique value in the column, sorted in descending order using the value_counts() method of the Pandas DataFrame.

  6. The sixth line of code prints a separator line between each column's value counts.

Overall, this code provides a quick overview of the dataset by printing the number of rows and columns, and the value counts of each column in the dataset, which helps to identify the most common values in each attribute.

In [4]:
# Show the title of the column, the dtype and the lenght of the dataset
data.info() 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21611 non-null  float64
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long           21613 non-null  float64
 19  sqft_living15  21613 non-null  int64  
 20  sqft_lot15     21613 non-null  int64  
dtypes: float64(6), int64(14), object(1)
memory usage: 3.5+ MB

Show the title of the column, the dtype and the lenght of the dataset data.info() explain this code in 1 line

This code prints a summary of the dataset, including the column names, data types, and number of non-null values in each column.

In [5]:
 # Statistic approach to numerical variables of the dataset
data.describe()
Out[5]:
id price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
count 2.161300e+04 2.161300e+04 21613.000000 21613.000000 21613.000000 2.161300e+04 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21611.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000
mean 4.580302e+09 5.400881e+05 3.370842 2.114757 2079.899736 1.510697e+04 1.494309 0.007542 0.234303 3.409430 7.656873 1788.396095 291.509045 1971.005136 84.402258 98077.939805 47.560053 -122.213896 1986.552492 12768.455652
std 2.876566e+09 3.671272e+05 0.930062 0.770163 918.440897 4.142051e+04 0.539989 0.086517 0.766318 0.650743 1.175459 828.128162 442.575043 29.373411 401.679240 53.505026 0.138564 0.140828 685.391304 27304.179631
min 1.000102e+06 7.500000e+04 0.000000 0.000000 290.000000 5.200000e+02 1.000000 0.000000 0.000000 1.000000 1.000000 290.000000 0.000000 1900.000000 0.000000 98001.000000 47.155900 -122.519000 399.000000 651.000000
25% 2.123049e+09 3.219500e+05 3.000000 1.750000 1427.000000 5.040000e+03 1.000000 0.000000 0.000000 3.000000 7.000000 1190.000000 0.000000 1951.000000 0.000000 98033.000000 47.471000 -122.328000 1490.000000 5100.000000
50% 3.904930e+09 4.500000e+05 3.000000 2.250000 1910.000000 7.618000e+03 1.500000 0.000000 0.000000 3.000000 7.000000 1560.000000 0.000000 1975.000000 0.000000 98065.000000 47.571800 -122.230000 1840.000000 7620.000000
75% 7.308900e+09 6.450000e+05 4.000000 2.500000 2550.000000 1.068800e+04 2.000000 0.000000 0.000000 4.000000 8.000000 2210.000000 560.000000 1997.000000 0.000000 98118.000000 47.678000 -122.125000 2360.000000 10083.000000
max 9.900000e+09 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.500000 1.000000 4.000000 5.000000 13.000000 9410.000000 4820.000000 2015.000000 2015.000000 98199.000000 47.777600 -121.315000 6210.000000 871200.000000

describe data nummercal data

In [6]:
# Statistic approach to numerical variables of the dataset
data.describe().astype(int) 
Out[6]:
id price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
count 21613 21613 21613 21613 21613 21613 21613 21613 21613 21613 21613 21611 21613 21613 21613 21613 21613 21613 21613 21613
mean -2147483648 540088 3 2 2079 15106 1 0 0 3 7 1788 291 1971 84 98077 47 -122 1986 12768
std -2147483648 367127 0 0 918 41420 0 0 0 0 1 828 442 29 401 53 0 0 685 27304
min 1000102 75000 0 0 290 520 1 0 0 1 1 290 0 1900 0 98001 47 -122 399 651
25% 2123049194 321950 3 1 1427 5040 1 0 0 3 7 1190 0 1951 0 98033 47 -122 1490 5100
50% -2147483648 450000 3 2 1910 7618 1 0 0 3 7 1560 0 1975 0 98065 47 -122 1840 7620
75% -2147483648 645000 4 2 2550 10688 2 0 0 4 8 2210 560 1997 0 98118 47 -122 2360 10083
max -2147483648 7700000 33 8 13540 1651359 3 1 4 5 13 9410 4820 2015 2015 98199 47 -121 6210 871200

describe data nummercal data and convert datatype to integer

In [7]:
# Statistic approach to All variables of the dataset
data.describe(include='all') 
Out[7]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
count 2.161300e+04 21613 2.161300e+04 21613.000000 21613.000000 21613.000000 2.161300e+04 21613.000000 21613.000000 21613.000000 ... 21613.000000 21611.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000
unique NaN 372 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
top NaN 20140623T000000 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
freq NaN 142 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean 4.580302e+09 NaN 5.400881e+05 3.370842 2.114757 2079.899736 1.510697e+04 1.494309 0.007542 0.234303 ... 7.656873 1788.396095 291.509045 1971.005136 84.402258 98077.939805 47.560053 -122.213896 1986.552492 12768.455652
std 2.876566e+09 NaN 3.671272e+05 0.930062 0.770163 918.440897 4.142051e+04 0.539989 0.086517 0.766318 ... 1.175459 828.128162 442.575043 29.373411 401.679240 53.505026 0.138564 0.140828 685.391304 27304.179631
min 1.000102e+06 NaN 7.500000e+04 0.000000 0.000000 290.000000 5.200000e+02 1.000000 0.000000 0.000000 ... 1.000000 290.000000 0.000000 1900.000000 0.000000 98001.000000 47.155900 -122.519000 399.000000 651.000000
25% 2.123049e+09 NaN 3.219500e+05 3.000000 1.750000 1427.000000 5.040000e+03 1.000000 0.000000 0.000000 ... 7.000000 1190.000000 0.000000 1951.000000 0.000000 98033.000000 47.471000 -122.328000 1490.000000 5100.000000
50% 3.904930e+09 NaN 4.500000e+05 3.000000 2.250000 1910.000000 7.618000e+03 1.500000 0.000000 0.000000 ... 7.000000 1560.000000 0.000000 1975.000000 0.000000 98065.000000 47.571800 -122.230000 1840.000000 7620.000000
75% 7.308900e+09 NaN 6.450000e+05 4.000000 2.500000 2550.000000 1.068800e+04 2.000000 0.000000 0.000000 ... 8.000000 2210.000000 560.000000 1997.000000 0.000000 98118.000000 47.678000 -122.125000 2360.000000 10083.000000
max 9.900000e+09 NaN 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.500000 1.000000 4.000000 ... 13.000000 9410.000000 4820.000000 2015.000000 2015.000000 98199.000000 47.777600 -121.315000 6210.000000 871200.000000

11 rows × 21 columns

describe data nummercal and category data and convert datatype to integer

In [8]:
# Adds information about the missing values to each of the columns
data.isnull().sum() 
Out[8]:
id               0
date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       2
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

check missing data

In [9]:
# Checking for '0' in salaries
data.query("price == 0")
Out[9]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15

0 rows × 21 columns

check if dataset has a price=0

In [10]:
#Change format to standarize the dataset describe() output
pd.set_option('display.float_format', lambda x: '%.5f' % x) # Set 5 decimals to eliminate numerical notation
data.describe()
Out[10]:
id price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
count 21613.00000 21613.00000 21613.00000 21613.00000 21613.00000 21613.00000 21613.00000 21613.00000 21613.00000 21613.00000 21613.00000 21611.00000 21613.00000 21613.00000 21613.00000 21613.00000 21613.00000 21613.00000 21613.00000 21613.00000
mean 4580301520.86499 540088.14177 3.37084 2.11476 2079.89974 15106.96757 1.49431 0.00754 0.23430 3.40943 7.65687 1788.39609 291.50905 1971.00514 84.40226 98077.93980 47.56005 -122.21390 1986.55249 12768.45565
std 2876565571.31205 367127.19648 0.93006 0.77016 918.44090 41420.51152 0.53999 0.08652 0.76632 0.65074 1.17546 828.12816 442.57504 29.37341 401.67924 53.50503 0.13856 0.14083 685.39130 27304.17963
min 1000102.00000 75000.00000 0.00000 0.00000 290.00000 520.00000 1.00000 0.00000 0.00000 1.00000 1.00000 290.00000 0.00000 1900.00000 0.00000 98001.00000 47.15590 -122.51900 399.00000 651.00000
25% 2123049194.00000 321950.00000 3.00000 1.75000 1427.00000 5040.00000 1.00000 0.00000 0.00000 3.00000 7.00000 1190.00000 0.00000 1951.00000 0.00000 98033.00000 47.47100 -122.32800 1490.00000 5100.00000
50% 3904930410.00000 450000.00000 3.00000 2.25000 1910.00000 7618.00000 1.50000 0.00000 0.00000 3.00000 7.00000 1560.00000 0.00000 1975.00000 0.00000 98065.00000 47.57180 -122.23000 1840.00000 7620.00000
75% 7308900445.00000 645000.00000 4.00000 2.50000 2550.00000 10688.00000 2.00000 0.00000 0.00000 4.00000 8.00000 2210.00000 560.00000 1997.00000 0.00000 98118.00000 47.67800 -122.12500 2360.00000 10083.00000
max 9900000190.00000 7700000.00000 33.00000 8.00000 13540.00000 1651359.00000 3.50000 1.00000 4.00000 5.00000 13.00000 9410.00000 4820.00000 2015.00000 2015.00000 98199.00000 47.77760 -121.31500 6210.00000 871200.00000

print describtion of data and format is 5 decimal

Delete¶

CheckMissingValue¶

In [11]:
# Check data in null and count it
data.isnull().sum()
Out[11]:
id               0
date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       2
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64
In [12]:
# drop nan value
data.dropna(inplace=True)

drop null

In [13]:
#check missing value after delete missing
data.isnull().sum()
Out[13]:
id               0
date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

DropDuplicate¶

In [14]:
# Drop duplicates and print shape before drop
print('Shape of data before remove duolicate',data.shape)
# Drop duplicates and print shape after drop
data.drop_duplicates(inplace=True)
print('Shape of data after remove duolicate',data.shape)
Shape of data before remove duolicate (21611, 21)
Shape of data after remove duolicate (21611, 21)

data.drop_duplicates(inplace=True): This drops duplicate rows in the DataFrame data and modifies the DataFrame in-place. print: This is a Python function that prints a message to the console. 'Shape of data after remove duplicate': This is a string message that will be printed to the console. data.shape: This returns a tuple representing the dimensions of the DataFrame after the duplicates have been dropped, where the first element is the number of rows and the second element is the number of columns.

DropIrrelevantColumns¶

In [15]:
(data[data['yr_renovated']==0].shape)[0]
Out[15]:
20697

This selects all rows in the data DataFrame where the yr_renovated column is equal to 0.

In [16]:
(data[data['sqft_living']==data['sqft_living15']].shape)[0]
Out[16]:
2566

This selects all rows in the data DataFrame where the sqft_living column is equal to the sqft_living15 column. .shape: This returns a tuple representing the dimensions of the resulting DataFrame, where the first element is the number of rows and the second element is the number of columns. [0]: This selects the first element of the tuple, which corresponds to the number of rows.

In [17]:
(data[data['sqft_lot']==data['sqft_lot15']].shape)[0]
Out[17]:
4474

This line of Python code calculates the number of rows in a DataFrame data where the value in the sqft_lot column is equal to the value in the sqft_lot15 column. Here's a comment explaining each part of the code:

(data[data['sqft_lot']==data['sqft_lot15']].shape)[0]
  • data[data['sqft_lot']==data['sqft_lot15']]: This selects all rows in the data DataFrame where the sqft_lot column is equal to the sqft_lot15 column.
  • .shape: This returns a tuple representing the dimensions of the resulting DataFrame, where the first element is the number of rows and the second element is the number of columns.
  • [0]: This selects the first element of the tuple, which corresponds to the number of rows.

Therefore, the overall line of code returns the number of rows in data where sqft_lot is equal to sqft_lot15.

In [18]:
# Drop irrelevant columns
data.drop(['yr_renovated','id'], axis=1, inplace=True)
#print shape
data.shape
Out[18]:
(21611, 19)

data.drop(['yr_renovated', 'id'], axis=1, inplace=True): This drops the yr_renovated and id columns from the data DataFrame and modifies the DataFrame in-place. ['yr_renovated', 'id']: This is a list of column names to drop. axis=1: This indicates that the columns should be dropped along the horizontal axis (i.e., columns). inplace=True: This indicates that the DataFrame should be modified in-place, rather than creating a new DataFrame. data.shape: This returns a tuple representing the dimensions of the DataFrame after the columns have been dropped, where the first element is the number of rows and the second element is the number of columns.

Dataintegration¶

ZipCode¶

Data were collected from US so, we will use uszipcode

what we can get from it ? City, State, Population, Population Density, Housing Units

In [19]:
# Create an instance of the SearchEngine class
engine = SearchEngine()

# Record the start time
start = time.time()

# Define a function to get the location information for a given zipcode and add it to a DataFrame
def get_location(zipcode, data):
    # Use the SearchEngine instance to get the location information for the given zipcode
    location = engine.by_zipcode(zipcode)
    
    # Add the location information to the DataFrame
    data["city"] = location.major_city
    data["state"] = location.state
    data["county"] = location.county
    data["population"] = location.population
    data["population_density"] = location.population_density

    # Return the updated DataFrame
    return data

# Apply the get_location function to each row of the DataFrame using the apply method
data = data.apply(lambda x: get_location(x['zipcode'], x), axis=1)

# Record the end time
end = time.time()

# Print the execution time and the updated DataFrame
print(f"The time of execution of above program is :{end-start}\n")
data
The time of execution of above program is :230.04953861236572

Out[19]:
date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition ... zipcode lat long sqft_living15 sqft_lot15 city state county population population_density
0 20141013T000000 221900.00000 3 1.00000 1180 5650 1.00000 0 0 3 ... 98178 47.51120 -122.25700 1340 5650 Seattle WA King County 24092 4966.00000
1 20141209T000000 538000.00000 3 2.25000 2570 7242 2.00000 0 0 3 ... 98125 47.72100 -122.31900 1690 7639 Seattle WA King County 37081 6879.00000
2 20150225T000000 180000.00000 2 1.00000 770 10000 1.00000 0 0 3 ... 98028 47.73790 -122.23300 2720 8062 Kenmore WA King County 20419 3606.00000
3 20141209T000000 604000.00000 4 3.00000 1960 5000 1.00000 0 0 5 ... 98136 47.52080 -122.39300 1360 5000 Seattle WA King County 14770 6425.00000
4 20150218T000000 510000.00000 3 2.00000 1680 8080 1.00000 0 0 3 ... 98074 47.61680 -122.04500 1800 7503 Sammamish WA King County 25748 2411.00000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21608 20140521T000000 360000.00000 3 2.50000 1530 1131 3.00000 0 0 3 ... 98103 47.69930 -122.34600 1530 1509 Seattle WA King County 45911 9905.00000
21609 20150223T000000 400000.00000 4 2.50000 2310 5813 2.00000 0 0 3 ... 98146 47.51070 -122.36200 1830 7200 Seattle WA King County 25922 5573.00000
21610 20140623T000000 402101.00000 2 0.75000 1020 1350 2.00000 0 0 3 ... 98144 47.59440 -122.29900 1020 2007 Seattle WA King County 26881 7895.00000
21611 20150116T000000 400000.00000 3 2.50000 1600 2388 2.00000 0 0 3 ... 98027 47.53450 -122.06900 1410 1287 Issaquah WA King County 26141 469.00000
21612 20141015T000000 325000.00000 2 0.75000 1020 1076 2.00000 0 0 3 ... 98144 47.59410 -122.29900 1020 1357 Seattle WA King County 26881 7895.00000

21611 rows × 24 columns

engine = SearchEngine(): This creates an instance of the SearchEngine class from the uszipcode library. start = time.time(): This records the current time in seconds since the Epoch using the time.time() function and assigns it to the variable start. def get_location(zipcode, data):: This defines a function get_location that takes a zipcode and a DataFrame data as arguments. location = engine.by_zipcode(zipcode): This uses the SearchEngine instance to get the location information for the given zipcode and assigns it to the variable location. data["city"] = location.major_city: This adds a new column to the DataFrame data with the major city name from the location information. data["state"] = location.state: This adds a new column to the DataFrame data with the state name from the location information. data["county"] = location.county: This adds a new column to the DataFrame data with the county name from the location information. data["population"] = location.population: This adds a new column to the DataFrame data with the population count from the location information. data["population_density"] = location.population_density: This adds a new column to the DataFrame data with the population density from the location information. return data: This returns the updated DataFrame from the get_location function. data = data.apply(lambda x: get_location(x['zipcode'], x), axis=1): This applies the get_location function to each row of the DataFrame data using the apply method and assigns the updated DataFrame to data. end = time.time(): This records the current time in seconds since the Epoch using the time.time() function and assigns it to the variable end. print(f"The time of execution of above program is :{end-start}\n"): This prints the execution time of the program by subtracting start from end and formatting the result as a string. data: This prints the updated DataFrame with the new location information columns.

ManipulateColumnsWrongFormating¶

In [20]:
# Convert the date column to a datetime format
data['date'] = pd.to_datetime(data['date'])

# Add a new column for the year of the transaction
data["tr_year"] = data["date"].dt.year

# Add a new column for the month of the transaction
data["tr_month"] = data["date"].dt.month

# Change the date column to a string format with only year and month
data["date"] = data["date"].dt.strftime('%Y-%m')

# Print the updated DataFrame
data
Out[20]:
date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition ... long sqft_living15 sqft_lot15 city state county population population_density tr_year tr_month
0 2014-10 221900.00000 3 1.00000 1180 5650 1.00000 0 0 3 ... -122.25700 1340 5650 Seattle WA King County 24092 4966.00000 2014 10
1 2014-12 538000.00000 3 2.25000 2570 7242 2.00000 0 0 3 ... -122.31900 1690 7639 Seattle WA King County 37081 6879.00000 2014 12
2 2015-02 180000.00000 2 1.00000 770 10000 1.00000 0 0 3 ... -122.23300 2720 8062 Kenmore WA King County 20419 3606.00000 2015 2
3 2014-12 604000.00000 4 3.00000 1960 5000 1.00000 0 0 5 ... -122.39300 1360 5000 Seattle WA King County 14770 6425.00000 2014 12
4 2015-02 510000.00000 3 2.00000 1680 8080 1.00000 0 0 3 ... -122.04500 1800 7503 Sammamish WA King County 25748 2411.00000 2015 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21608 2014-05 360000.00000 3 2.50000 1530 1131 3.00000 0 0 3 ... -122.34600 1530 1509 Seattle WA King County 45911 9905.00000 2014 5
21609 2015-02 400000.00000 4 2.50000 2310 5813 2.00000 0 0 3 ... -122.36200 1830 7200 Seattle WA King County 25922 5573.00000 2015 2
21610 2014-06 402101.00000 2 0.75000 1020 1350 2.00000 0 0 3 ... -122.29900 1020 2007 Seattle WA King County 26881 7895.00000 2014 6
21611 2015-01 400000.00000 3 2.50000 1600 2388 2.00000 0 0 3 ... -122.06900 1410 1287 Issaquah WA King County 26141 469.00000 2015 1
21612 2014-10 325000.00000 2 0.75000 1020 1076 2.00000 0 0 3 ... -122.29900 1020 1357 Seattle WA King County 26881 7895.00000 2014 10

21611 rows × 26 columns

  • data['date'] = pd.to_datetime(data['date']): This converts the date column of the DataFrame data to a datetime format using the pd.to_datetime() function from the Pandas library.
  • data["tr_year"] = data["date"].dt.year: This creates a new column in the DataFrame data called tr_year that contains the year of the transaction, obtained from the date column by using the .dt.year attribute.
  • data["tr_month"] = data["date"].dt.month: This creates a new column in the DataFrame data called tr_month that contains the month of the transaction, obtained from the date column by using the .dt.month attribute.
  • data["date"] = data["date"].dt.strftime('%Y-%m'): This modifies the date column of the DataFrame data to a string format with only year and month, obtained from the date column by using the .dt.strftime() method with the %Y-%m format.
  • data: This prints the updated DataFrame with the new columns and modified date column.

Therefore, the overall code performs transformations on the date column of data to extract the year and month of the transaction and change the date format to only show year and month.

In [21]:
# print MAX and MIN Month And Year
(data[['tr_year','tr_month']].max(),data[['tr_year','tr_month']].min())
Out[21]:
(tr_year     2015
 tr_month      12
 dtype: int64,
 tr_year     2014
 tr_month       1
 dtype: int64)

the overall line of code returns a tuple containing the maximum and minimum values for the tr_year and tr_month columns of data. The first element of the tuple contains the maximum values, and the second element of the tuple contains the minimum values.

In [22]:
# Convert the 'price' column to integer data type
data['price'] = data['price'].astype(int)
# Convert the ' population density ' column to integer data type
data['population_density'] = data['population_density'].astype(int)

the overall code converts the price and population_density columns of data to integer data types, which is useful when performing mathematical operations on these columns.

In [23]:
#print Information of data
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21611 entries, 0 to 21612
Data columns (total 26 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   date                21611 non-null  object 
 1   price               21611 non-null  int32  
 2   bedrooms            21611 non-null  int64  
 3   bathrooms           21611 non-null  float64
 4   sqft_living         21611 non-null  int64  
 5   sqft_lot            21611 non-null  int64  
 6   floors              21611 non-null  float64
 7   waterfront          21611 non-null  int64  
 8   view                21611 non-null  int64  
 9   condition           21611 non-null  int64  
 10  grade               21611 non-null  int64  
 11  sqft_above          21611 non-null  float64
 12  sqft_basement       21611 non-null  int64  
 13  yr_built            21611 non-null  int64  
 14  zipcode             21611 non-null  int64  
 15  lat                 21611 non-null  float64
 16  long                21611 non-null  float64
 17  sqft_living15       21611 non-null  int64  
 18  sqft_lot15          21611 non-null  int64  
 19  city                21611 non-null  object 
 20  state               21611 non-null  object 
 21  county              21611 non-null  object 
 22  population          21611 non-null  int64  
 23  population_density  21611 non-null  int32  
 24  tr_year             21611 non-null  int64  
 25  tr_month            21611 non-null  int64  
dtypes: float64(5), int32(2), int64(15), object(4)
memory usage: 4.3+ MB

print info after update data

In [24]:
# Take copy from data
df_copy = data.copy(deep=True)

Take a copy of data

In [25]:
#calculates the percentage of 0 values in a DataFrame `df_copy` relative to the total number of rows in another DataFrame `data`
round((df_copy[df_copy == 0].count()/data.shape[0])*100)
Out[25]:
date                  0.00000
price                 0.00000
bedrooms              0.00000
bathrooms             0.00000
sqft_living           0.00000
sqft_lot              0.00000
floors                0.00000
waterfront           99.00000
view                 90.00000
condition             0.00000
grade                 0.00000
sqft_above            0.00000
sqft_basement        61.00000
yr_built              0.00000
zipcode               0.00000
lat                   0.00000
long                  0.00000
sqft_living15         0.00000
sqft_lot15            0.00000
city                  0.00000
state                 0.00000
county                0.00000
population            0.00000
population_density    0.00000
tr_year               0.00000
tr_month              0.00000
dtype: float64

This code calculates the percentage of zero values in a Pandas DataFrame df_copy relative to the total number of rows in another DataFrame data.

  • The first part of the code df_copy[df_copy == 0].count() creates a boolean DataFrame where True values indicate the presence of a zero value in df_copy, and then counts the number of True values for each column.

  • The second part of the code data.shape[0] gets the total number of rows in the original DataFrame data.

  • The result of the above calculation is then multiplied by 100 to get the percentage of zero values in df_copy relative to the total number of rows in data.

  • The round() function is used to round the result to the nearest whole number.

In [26]:
# drop waterfront column from water from df_copy 
df_copy.drop('waterfront',axis=1,inplace = True )
#print dataset
df_copy
Out[26]:
date price bedrooms bathrooms sqft_living sqft_lot floors view condition grade ... long sqft_living15 sqft_lot15 city state county population population_density tr_year tr_month
0 2014-10 221900 3 1.00000 1180 5650 1.00000 0 3 7 ... -122.25700 1340 5650 Seattle WA King County 24092 4966 2014 10
1 2014-12 538000 3 2.25000 2570 7242 2.00000 0 3 7 ... -122.31900 1690 7639 Seattle WA King County 37081 6879 2014 12
2 2015-02 180000 2 1.00000 770 10000 1.00000 0 3 6 ... -122.23300 2720 8062 Kenmore WA King County 20419 3606 2015 2
3 2014-12 604000 4 3.00000 1960 5000 1.00000 0 5 7 ... -122.39300 1360 5000 Seattle WA King County 14770 6425 2014 12
4 2015-02 510000 3 2.00000 1680 8080 1.00000 0 3 8 ... -122.04500 1800 7503 Sammamish WA King County 25748 2411 2015 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21608 2014-05 360000 3 2.50000 1530 1131 3.00000 0 3 8 ... -122.34600 1530 1509 Seattle WA King County 45911 9905 2014 5
21609 2015-02 400000 4 2.50000 2310 5813 2.00000 0 3 8 ... -122.36200 1830 7200 Seattle WA King County 25922 5573 2015 2
21610 2014-06 402101 2 0.75000 1020 1350 2.00000 0 3 7 ... -122.29900 1020 2007 Seattle WA King County 26881 7895 2014 6
21611 2015-01 400000 3 2.50000 1600 2388 2.00000 0 3 8 ... -122.06900 1410 1287 Issaquah WA King County 26141 469 2015 1
21612 2014-10 325000 2 0.75000 1020 1076 2.00000 0 3 7 ... -122.29900 1020 1357 Seattle WA King County 26881 7895 2014 10

21611 rows × 25 columns

This code drops the column named 'waterfront' from the Pandas DataFrame df_copy using the drop() method. The axis=1 parameter specifies that the column should be dropped, and the inplace=True parameter specifies that the changes should be made to df_copy directly.

After dropping the column, the updated df_copy DataFrame is printed to the console

Outliers¶

In [27]:
# Define a function to handle outliers in a DataFrame
def handling_outliers(df, display=False, drop=False, drop_order=1, columns_to_drop=[]):
    
    # Get a list of numerical columns in the DataFrame
    numerical_columns = list((df.select_dtypes(include=np.number)).columns)

    # If display is True, plot boxplots for each numerical column
    if display:
        x = math.ceil(len(numerical_columns)/3)
        plt.figure(figsize=(15, 25))
        plt.subplots_adjust(hspace=0.5)
        plt.suptitle("Outliers Detection")
        for i in numerical_columns:
            y = numerical_columns.index(i) + 1
            ax = plt.subplot(x, 3, y)
            ax = sns.boxplot(x=df[i], data=df)
            ax.set_title(i)
    
    # If drop is True, remove outliers from the DataFrame
    if drop == True:
        
        # If columns_to_drop is not empty, use those columns
        if (len(columns_to_drop) != 0):
            numerical_columns = columns_to_drop
            
        # If drop_order is less than 1, set it to 1
        elif drop_order < 1:
            drop_order = 1
            
        # Remove outliers drop_order times using the interquartile range (IQR) method
        while drop_order != 0:
            
            for i in numerical_columns:
                q1 = df[i].quantile(0.25)
                q3 = df[i].quantile(0.75)
                iqr = q3 - q1
                lower = q1 - 1.5*iqr
                if lower < 0:
                    lower = 0
                higher = q3 + 1.5*iqr
                df = df[df[i] >= lower] 
                df = df[df[i] <= higher]
            
            drop_order = drop_order - 1
    
    # Return the updated DataFrame
    return df

this code defines a function handling_outliers that takes a DataFrame df and some optional arguments to either display or remove outliers from the DataFrame. If display is True, the function plots boxplots for each numerical column in the DataFrame. If drop is True, the function removes outliers from the DataFrame using the interquartile range (IQR) method. The drop_order argument specifies how many times to remove outliers, and the columns_to_drop argument allows the user to specify which columns to remove outliers from. Finally, the function returns the updated DataFrame.

Before dropping just Specific Columns

In [28]:
# Calling handling_outliers ana passing a parameter and column to delete
df_1 = handling_outliers(df_copy , display= True , drop=True , drop_order=2 , columns_to_drop =['price','bedrooms','bathrooms','sqft_living','sqft_lot','sqft_basement','sqft_living15','sqft_lot15','grade'])

This code calls a function named handling_outliers and passes several parameters to it:

  • df_copy is a Pandas DataFrame that will be passed to the function for outlier handling.

  • display=True specifies that the function should print information about the handling of outliers to the console.

  • drop=True specifies that the function should drop the rows containing outliers from the DataFrame.

  • drop_order=2 specifies that the function should drop the rows containing outliers based on their Z-score values.

  • columns_to_drop=['price','bedrooms','bathrooms'] specifies that the function should drop the 'price', 'bedrooms', and 'bathrooms' columns from the DataFrame before handling outliers.

The function will then perform outlier handling on the DataFrame and return a new DataFrame df_1 that has had the specified columns dropped and outliers removed. The display=True parameter will print information about the handling of outliers to the console.

In [29]:
# Calling handling_outliers and passing parameter to see handling_outliers removed outlier or not
df_1 = handling_outliers(df_1 , display= True )

After dropping just Specific Columns

This code calls a function named handling_outliers and passes several parameters to it:

  • df_1 is a Pandas DataFrame that will be passed to the function for outlier handling.

  • display=True specifies that the function should print information about the handling of outliers to the console.

The function will then perform outlier handling on the DataFrame and return a new DataFrame df_1 that has had the outliers removed. The display=True parameter will print information about the handling of outliers to the console, allowing the user to see if any outliers were removed.

DescriptiveStatisticsAnalysis¶

In [ ]:
 
In [30]:
# Calculate the minimum and maximum prices for each city in df_copy
df_1 = [df_copy.groupby("city")["price"].min(), df_copy.groupby("city")["price"].max()]
df_1 = pd.DataFrame(df_1).round()
df_1.index = ['Min Price', 'Max Price']
df_1 = df_1.T

# Calculate the average price for each city in 2014 and 2015
avg_2014 = (df_copy[df_copy['tr_year'] == 2014]).groupby('city')['price'].mean()
avg_2015 = (df_copy[df_copy['tr_year'] == 2015]).groupby('city')['price'].mean()

# Combine the average prices for 2014 and 2015 into a single DataFrame
avg = pd.DataFrame({'Avg_price_2014': avg_2014, 'Avg_price_2015': avg_2015}).round()

# Merge the minimum and maximum prices and average prices into a single DataFrame
df_1 = pd.merge(df_1, avg, right_index=True, left_index=True)

# Return the DataFrame with the minimum, maximum, and average prices for each city
df_1
Out[30]:
Min Price Max Price Avg_price_2014 Avg_price_2015
city
Auburn 90000 930000 290465.00000 293466.00000
Bellevue 247500 7062500 868641.00000 964244.00000
Black Diamond 135000 935000 423160.00000 424491.00000
Bothell 245500 1075000 484805.00000 505211.00000
Carnation 80000 1680000 457490.00000 450234.00000
Duvall 119500 1015000 425077.00000 424247.00000
Enumclaw 75000 858000 315381.00000 316340.00000
Fall City 142000 1862000 550451.00000 629035.00000
Federal Way 86500 1275000 288659.00000 290833.00000
Issaquah 130000 2700000 613217.00000 619490.00000
Kenmore 160000 1600000 454812.00000 476956.00000
Kent 85000 859000 295941.00000 305853.00000
Kirkland 90000 5110800 636751.00000 667882.00000
Maple Valley 110000 1350000 362377.00000 374816.00000
Medina 787500 6885000 2347732.00000 1628019.00000
Mercer Island 500000 5300000 1187996.00000 1208438.00000
North Bend 175000 1950000 424430.00000 484867.00000
Redmond 170000 2280000 656615.00000 664390.00000
Renton 95000 3000000 404593.00000 401266.00000
Sammamish 280000 3200000 727859.00000 744102.00000
Seattle 78000 7700000 532586.00000 540007.00000
Snoqualmie 170000 1998000 517542.00000 546641.00000
Vashon 160000 1379900 478821.00000 515311.00000
Woodinville 200000 1920000 613481.00000 625420.00000

In summary, this code calculates the minimum, maximum, and average prices for each city in a DataFrame df_copy that has a price column and a tr_year column. It first calculates the minimum and maximum prices for each city using the groupby() method and creates a DataFrame df_1 to store the results. It then calculates the average price for each city in 2014 and 2015 using the groupby() method and creates a DataFrame avg to store the results. Finally, it merges the minimum and maximum prices and average prices into a single DataFrame df_1 using the merge() method and returns the DataFrame.

In [31]:
# Create a bar plot of the DataFrame df_1
df_1.plot(kind='bar', figsize=(20,10))

# Rotate the x-axis labels by 45 degrees
plt.xticks(rotation=90)
Out[31]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23]),
 [Text(0, 0, 'Auburn'),
  Text(1, 0, 'Bellevue'),
  Text(2, 0, 'Black Diamond'),
  Text(3, 0, 'Bothell'),
  Text(4, 0, 'Carnation'),
  Text(5, 0, 'Duvall'),
  Text(6, 0, 'Enumclaw'),
  Text(7, 0, 'Fall City'),
  Text(8, 0, 'Federal Way'),
  Text(9, 0, 'Issaquah'),
  Text(10, 0, 'Kenmore'),
  Text(11, 0, 'Kent'),
  Text(12, 0, 'Kirkland'),
  Text(13, 0, 'Maple Valley'),
  Text(14, 0, 'Medina'),
  Text(15, 0, 'Mercer Island'),
  Text(16, 0, 'North Bend'),
  Text(17, 0, 'Redmond'),
  Text(18, 0, 'Renton'),
  Text(19, 0, 'Sammamish'),
  Text(20, 0, 'Seattle'),
  Text(21, 0, 'Snoqualmie'),
  Text(22, 0, 'Vashon'),
  Text(23, 0, 'Woodinville')])

he overall code creates a bar plot of df_1 with rotated x-axis labels. The resulting plot shows the minimum, maximum, and average prices for each city in a visual format.

In [32]:
# Calculate the percentage change in average house prices from 2014 to 2015 for each city
df_1['change_percentage'] = round(((df_1['Avg_price_2015'] - df_1['Avg_price_2014'])/df_1['Avg_price_2014'])*100 , 2)

# Reset the index of the DataFrame
df_1 = df_1.reset_index()

# Create a line plot showing the average house prices for each city in 2014 and 2015
# create fig size
plt.figure(figsize=(12,8))
# select per city and average price in 2014
plt.plot(df_1['city'], df_1['Avg_price_2014'], label='2014')
# select per city and average price in 2015
plt.plot(df_1['city'], df_1['Avg_price_2015'], label='2015')
# create rotation =90
plt.xticks(rotation=90)
plt.legend()
# X label Name 'City'
plt.xlabel('City')
# Y label Name 'Average House Price'
plt.ylabel('Average House Price')
# Title NAme 'Change in Average House Prices from 2014 to 2015'
plt.title('Change in Average House Prices from 2014 to 2015')
# display
plt.show()

the overall code calculates the percentage change in average house prices from 2014 to 2015 for each city, creates a line plot showing the average house prices for each city in 2014 and 2015, and adds labels and a title to the plot. The resulting plot shows the change in average house prices from 2014 to 2015 for each city in a visual format.

In [33]:
# Calculate the percentage change in average house prices from 2014 to 2015 for each city
df_1['change_percentage'] = round(((df_1['Avg_price_2015'] - df_1['Avg_price_2014'])/df_1['Avg_price_2014'])*100 , 2)

# Create a new figure with a specified size
plt.figure(figsize=(24,40))

# Adjust the spacing between subplots
plt.subplots_adjust(hspace=.5, wspace=0.1)

# Loop through each row of the DataFrame and create a horizontal bar plot for each city
for i in df_1.index:
    # Get the values for the horizontal bar plot and the percentage change
    value = [df_1['Avg_price_2014'][i], df_1['Avg_price_2015'][i]]
    p = df_1['change_percentage'][i]
    
    # Determine whether the percentage change is positive or negative and set the arrow direction and text accordingly
    if p > 0:
        a = 'Increased'
        t = '<-'
    else:
        a = 'Decreased'
        t = '->'
        
    # Determine the number of rows and columns for the subplots and create a new subplot
    x = math.ceil(df_1.shape[0]/2)
    plt.subplot(x, 2, i+1)
    
    # Create a horizontal bar plot with the values and colors for each year
    ax = plt.barh(['Avg_price_2014', 'Avg_price_2015'], value, color=['tab:gray', 'tab:blue'])
    ax = plt.gcf().gca()
    
    # Annotate the percentage change with an arrow and text
    ax.annotate('{} by {}%'.format(a, p), 
                xy=(0, 'Avg_price_2015'),
                textcoords='axes fraction', 
                xytext=(0.8, 0.788),
                arrowprops=dict(facecolor='orange', lw=6, arrowstyle=t),
                horizontalalignment='right')
    
    # Set the title of the subplot to the city name
    ax.set_title(df_1['city'][i])

This code calculates the percentage change in average house prices from 2014 to 2015 for each city in the Pandas DataFrame df_1, and then creates a horizontal bar plot for each city that visualizes the change in prices.

  1. The first line of code creates a new column in df_1 called 'change_percentage' that calculates the percentage change in average house prices from 2014 to 2015.

  2. The second line of code creates a new figure with a specified size using the figure() function from the Matplotlib library. The figsize parameter specifies the width and height of the figure in inches.

  3. The third line of code adjusts the spacing between subplots using the subplots_adjust() function from Matplotlib. The hspace and wspace parameters control the vertical and horizontal spacing between subplots, respectively.

  4. The fourth line of code initiates a loop that iterates over each row in the DataFrame df_1 using the index attribute of the DataFrame.

  5. The fifth line of code gets the values for the horizontal bar plot and the percentage change for the current city. The value variable is a list containing the average house prices for 2014 and 2015, and the p variable is the percentage change in prices for the current city.

  6. The sixth line of code determines whether the percentage change is positive or negative and sets the arrow directionand text accordingly. If the percentage change is positive, the arrow direction is set to point left (indicating an increase) and the text is set to 'Increased'. Otherwise, the arrow direction is set to point right (indicating a decrease) and the text is set to 'Decreased'.

  7. The seventh line of code determines the number of rows and columns for the subplots and creates a new subplot using the subplot() function from Matplotlib. The ceil() function from the math module is used to round up the number of rows to the nearest integer.

  8. The eighth line of code creates a horizontal bar plot using the barh() function from Matplotlib. The barh() function creates a horizontal bar plot where the first argument is a list of y-values and the second argument is a list of corresponding x-values. In this case, the y-values are the strings 'Avg_price_2014' and 'Avg_price_2015', and the x-values are the average house prices for 2014 and 2015. The color parameter specifies the colors of the bars, with 'tab:gray' representing the color for 2014 and 'tab:blue' representing the color for 2015. The resulting ax variable contains the axis object for the current subplot.

  9. The ninth line of code annotates the percentage change with an arrow and text using the annotate() function from Matplotlib. The annotate() function adds the annotation to the plot and takes several arguments. The xy parameter specifies the location of the arrow, which is set to (0, 'Avg_price_2015') to indicate that the arrow starts at the left side of the plot and points towards the 'Avg_price_2015' bar. The textcoords parameter specifies the coordinate system for the text, which is set to 'axes fraction' to indicate that the text position is relative to the axis. The xytext parameter specifies the location of the text, which is set to (0.8, 0.788) to position the text to the right of the arrow. The arrowprops parameter controls the appearance of the arrow, including its color, thickness, and style, and is set to an orange face color with a thickness of 6 and an arrow style determined by the t variable. Finally, the horizontalalignment parameter specifies the horizontal alignment of the text relative to the arrow.

  10. The tenth line of code sets the title of the subplot to the city name using the set_title() method of the axis object. The city name is obtained from the 'city' column of the DataFrame df_1.

Overall, this code calculates the percentage change in average house prices from 2014 to 2015 for each city in df_1, and then creates a horizontal bar plot for each city that visualizes the change in pricesin a clear and concise manner. The annotations and arrow directions help to make it easy to quickly interpret the direction and magnitude of the price changes. The code also uses various functions and methods from the Matplotlib and math libraries to create, customize, and adjust the subplots and visualizations.

CategorisePrice¶

In [34]:
# print describe of 'price'
df_copy['price'].describe().round()
Out[34]:
count     21611.00000
mean     540085.00000
std      367143.00000
min       75000.00000
25%      321725.00000
50%      450000.00000
75%      645000.00000
max     7700000.00000
Name: price, dtype: float64

The first part of the code accesses the 'price' column of the DataFrame df_copy using the square bracket notation and passes it as an argument to the describe() method.

The describe() method calculates and returns a DataFrame with summary statistics for the 'price' column. These statistics include the count of non-null values, the mean, standard deviation, minimum value, 25th percentile, median (50th percentile), 75th percentile, and maximum value.

The round() function is used to round the summary statistics to the nearest integer. This is achieved by chaining the round() function to the end of the describe() method call using the dot notation.

In [35]:
# Create a deep copy of the DataFrame df_copy
df_3 = df_copy.copy(deep=True)

# Create a new column in the DataFrame df_3 that categorizes the prices of the houses
df_3['cat_price'] = pd.cut(x=df_copy['price'], bins=[0,230000,450000,900000,df_copy['price'].max()], 
                           labels=['Affordable', 'Mid-Priced', 'Expensive', 'Luxury'])

the overall code creates a new DataFrame df_3 that is a deep copy of an existing DataFrame df_copy and adds a new column to df_3 that categorizes the prices of the houses into four categories: "Affordable", "Mid-Priced", "Expensive", and "Luxury"

In [36]:
# Create a new figure with a specified size
plt.figure(figsize=(24,10))

# Create a countplot of the price categories in the DataFrame df_3 using Seaborn
sns.countplot(x='cat_price', data=df_3);

the overall code creates a countplot showing the frequency of each price category in the DataFrame df_3. The resulting plot provides a visual summary of the distribution of house prices in df_3.

In [37]:
# Define a list of the price categories
cat = ['Affordable', 'Mid-Priced', 'Expensive', 'Luxury']

# Loop through each price category and create a line plot of the average house prices over time
for i in cat:
    # Create a new figure with a specified size
    plt.figure(figsize=(24, 10))
    
    # Get the average house prices over time for the current price category
    x = (df_3[df_3['cat_price'] == i]).groupby('date')['price'].mean()
    
    # Add labels and annotations to the plot
    plt.xlabel("Date")
    plt.ylabel("Price")
    plt.axhline(y=x.mean(), color='r', linewidth=10, label='Average')
    plt.title(i)
    
    # Plot the average house prices over time
    plt.plot(x, color='gray', label='Price', linewidth=5)
    
    # Add grid lines and a legend to the plot
    plt.grid(color='black', linestyle='--', linewidth=0.2)
    plt.legend()

he overall code creates a set of line plots showing the average house prices over time for each price category in the DataFrame df_3. Each plot includes a horizontal line at the average house price for the corresponding price category and a legend indicating the average and actual house prices over time. The resulting plots provide a visual summary of how house prices have varied over time for different price categories.

In [38]:
# Get the total house prices over time for the years 2014 and 2015
x = (df_copy[df_copy['tr_year'] == 2014]).groupby('date')['price'].sum()
y = (df_copy[df_copy['tr_year'] == 2015]).groupby('date')['price'].sum()

# Create a new figure with a specified size
plt.figure(figsize=(24, 10))

# Add labels and annotations to the plot
# create Xlabel Name 'Date'
plt.xlabel("Date")
#create Ylabel NAme 'Price'
plt.ylabel("Price")
# create a x and color and wideh
plt.plot(x, color='green', label='T_Price_2014', linewidth=10)
plt.plot(y, color='yellow', label='T_Price_2015', linewidth=10)
# in grid passing Color and line stle and width
plt.grid(color='black', linestyle='--', linewidth=0.2)

plt.legend()
Out[38]:
<matplotlib.legend.Legend at 0x1be1242c310>

the overall code creates a line plot showing the total house prices over time for two different years in the DataFrame df_copy. The resulting plot provides a visual summary of how the total house prices have varied over time between the years 2014 and 2015.

PricePerSqft¶

In [39]:
# Create a deep copy of the DataFrame df_copy
df_avg = df_copy.copy(deep=True)

# Calculate the average price per square foot for living area and lot area
df_avg['avg_living_15'] = (df_copy['price'] / df_copy['sqft_living15']).round(2)
df_avg['avg_lot_15'] = (df_copy['price'] / df_copy['sqft_lot15']).round(2)

# Calculate the overall average price per square foot
df_avg['price_sqfr_avg'] = ((df_avg['avg_living_15'] + df_avg['avg_lot_15'])/2).round(2)

# Get the average price per square foot over time
x = df_avg.groupby('date')['price_sqfr_avg'].mean()

# Create a new figure with a specified size
plt.figure(figsize=(24, 10))

# Add labels and annotations to the plot
plt.xlabel("Date")
plt.ylabel("Average Price per Sqft")
plt.grid(color='green', linestyle='--', linewidth=0.2)
plt.plot(x, color='r', label='Price per Sqft', linewidth=10);

the overall code creates a line plot showing the average price per square foot over time in the DataFrame df_avg. The resulting plot provides a visual summary of how the average price per square foot has varied over time, which can be useful for understanding trends in the housing market.

In [40]:
# displau linear model plot
sns.lmplot(x='sqft_lot15',y='price',data=df_avg)
Out[40]:
<seaborn.axisgrid.FacetGrid at 0x1be0e4233a0>

bad Relation

PricePerSqftEachCity¶

In [41]:
# Group the DataFrame df_avg by city and get the mean price per square foot
city_sqft = df_avg.groupby('city')['price_sqfr_avg'].mean()

# Create a new figure with a specified size
plt.figure(figsize=(24, 10))

# Extract the x and y values for the bar plot
x = city_sqft.index.to_list()
y = city_sqft.to_list()

# Create a bar plot using Seaborn
sns.barplot(x=x, y=y)

# Add labels and annotations to the plot
plt.xticks(rotation=45)
plt.xlabel("City")
plt.ylabel("Average Price per Sqft")
plt.grid(color='black', linestyle='--', linewidth=0.8, axis='y')

the overall code creates a bar plot showing the average price per square foot for each city in the DataFrame df_avg. The resulting plot provides a visual summary of how the average price per square foot varies between different cities, which can be useful for understanding regional differences in the housing market.

HouseAge¶

Housesbuilt¶

In [42]:
# Create a new figure with a specified size
plt.figure(figsize=(12, 10))

# Create a histogram using Seaborn
sns.histplot(df_copy['yr_built'], color='red')

# Add grid lines to the plot
plt.grid(color='black', linestyle='--', linewidth=0.5)

the overall code creates a histogram showing the distribution of the yr_built variable in the DataFrame df_copy. The resulting plot provides a visual summary of when the houses in the dataset were built and how many houses were built in each year.

In [43]:
# plot a linear model plaot
sns.lmplot(x='yr_built',y='price',data=df_copy)
Out[43]:
<seaborn.axisgrid.FacetGrid at 0x1be12124940>

MostModifiedHomes¶

In [44]:
# Create a new figure with a specified size
plt.figure(figsize=(24, 10))

# Create two histograms using Seaborn, one for houses renovated by living area and one for houses renovated by lot area
sns.histplot(data=df_copy, x='yr_built', color='b', label="Houses that renovated by living")
sns.histplot(data=df_copy, x='yr_built', color='r', label="Houses that renovated by lot")

# Add grid lines to the plot
plt.grid(color='black', linestyle='--', linewidth=0.8, axis='y')

# Add a legend to the plot
plt.legend()
Out[44]:
<matplotlib.legend.Legend at 0x1be1255fd30>

the overall code creates a histogram showing the distribution of the yr_built variable in the DataFrame df_copy for houses that were renovated by living area and houses that were renovated by lot area. The resulting plot provides a visual comparison of the distribution of building years for the two subsets of houses, which can be useful for understanding the renovation patterns in the housing market.

HouseAgevsAverageprice¶

In [45]:
# Create a deep copy of the DataFrame to avoid modifying the original data
df_age = df_copy.copy(deep=True)

# Categorize the yr_built variable into different age groups using the cut() function from Pandas
df_age['age'] = pd.cut(x=df_age['yr_built'], bins=[0, 1939, 1949, 1959, 1969, 1979, 1989, 1999, 2009, df_age['yr_built'].max()], 
                        labels=['1939 Or Earlier', '1940s', '1950s', '1960s', '1970s', '1980s', '1990s', '2000s', '2010 Or Later'])

# Group the DataFrame by age and calculate the mean price for each age group
df_new = df_age.groupby('age')['price'].mean()

# Create a new figure with a specified size
plt.figure(figsize=(24, 10))

# Extract the x and y values for the bar plot
x = df_new.index.to_list()
y = df_new.to_list()

# Create a bar plot using Seaborn
sns.barplot(x=x, y=y)

# Add a title, labels, and annotations to the plot
plt.suptitle("House Age vs Average Price")
plt.xticks(rotation=45)
plt.xlabel("Age Group")
plt.ylabel("Average Price")
plt.grid(color='red', linestyle='--', linewidth=0.5, axis='y')

the overall code creates a bar plot showing the average price for each age group in the DataFrame df_age. The resulting plot provides a visual summary of how the average price varies for houses built in different time periods, which can be useful for understanding the relationship between house age and price in the housing market.

YearlyMonthlyTransactions¶

In [46]:
# Calculate the number of rows in the DataFrame for each year
y = [df_copy[df_copy['tr_year'] == 2014].shape[0], df_copy[df_copy['tr_year'] == 2015].shape[0]]

# Specify labels for the pie chart
mylabels = ["2014", "2015"]

# Specify an offset for the first slice of the pie chart
myexplode = [0.09, 0]

# Create a new figure with a specified size
plt.figure(figsize=(12, 8))

# Create a pie chart using Matplotlib
plt.pie(y, labels=mylabels, explode=myexplode, autopct='%1.1f%%', shadow=True, colors=['c', 'y'])

# Display the plot
plt.show()

the overall code creates a pie chart showing the distribution of the tr_year variable in the DataFrame df_copy for the years 2014 and 2015. The resulting plot provides a visual summary of the relative frequency of the two years in the dataset, which can be useful for understanding the temporal distribution of the data.

In [47]:
# Group the DataFrame by cat_price and tr_month, and calculate the value counts for each group
months = df_3.groupby('cat_price')['tr_month'].value_counts().unstack()

# Extract the x and y values for the pie chart
x = months.sum().index.to_list()
y = months.sum().to_list()

# Create a new figure with a specified size
plt.figure(figsize=(12, 8))

# Create a pie chart using Matplotlib
plt.pie(y, labels=x, autopct='%1.1f%%', startangle=0)

# Display the plot
plt.show()

the overall code creates a pie chart showing the distribution of the tr_month variable in the DataFrame df_3 for each category of the cat_price variable. The resulting plot provides a visual summary of the relative frequency of each month for each price category, which can be useful for understanding the temporal distribution of the data within each category.

Basementimpact¶

In [48]:
# Create two new DataFrames based on whether the house has a basement or not
df_base = df_copy[df_copy['sqft_basement'] != 0]
df_no_base = df_copy[df_copy['sqft_basement'] == 0]

# Calculate the ratio of basement square footage to total living square footage for houses with basements and store it in a new column
df_base['base_sqft_living'] = (df_base['sqft_basement'] / df_base['sqft_living']).round(2)

# Calculate the percentage-based price for houses with basements and store it in a new column
df_base['total_Pbase_price'] = ((df_base['price'] * df_base['base_sqft_living'])/df_base['price']).round(2)

# Calculate the mean percentage-based price for houses with basements and convert it to a percentage rounded to two decimal places
x = round(df_base['total_Pbase_price'].mean()*100 , 2)

the overall code creates a new DataFrame called df_base that includes only the rows from df_copy where the sqft_basement column is not equal to zero, calculates the ratio of the sqft_basement column to the sqft_living column for houses with basements, calculates the percentage-based price for each house with a basement, and calculates the mean percentage-based price for houses with basements. The resulting value is stored in the variable x. This analysis can be useful for understanding the relative value of houses with basements compared to those without basements.

In [49]:
# Calculate the mean percentage-based price for houses with basements and convert it to a percentage rounded to two decimal places
x = round(df_base['total_Pbase_price'].mean()*100 , 2)

# Specify labels for the pie chart
mylabels = ["Basement Representation of Total price", ""]

# Specify an offset for the first slice of the pie chart
myexplode = [0.09, 0]

# Create a new figure with a specified size
plt.figure(figsize=(12, 8))

# Create a pie chart using Matplotlib
plt.pie([x, 100-x], labels=mylabels, explode=myexplode, autopct='%1.1f%%', shadow=True, colors=['c', 'y'])

# Display the plot
plt.show()

the overall code creates a pie chart showing the percentage of the total house price that is represented by the basement for houses with basements in the DataFrame df_base. The resulting plot provides a visual summary of the relative value of the basement compared to the rest of the house for houses with basements, which can be useful for understanding the importance of the basement in the overall value of the house.

HouseFeaturesVsPrice¶

Grade¶

In [50]:
# Count grade
df_copy['grade'].value_counts()
Out[50]:
7     8980
8     6067
9     2615
6     2038
10    1134
11     399
5      242
12      90
4       29
13      13
3        3
1        1
Name: grade, dtype: int64
In [51]:
# Create a new DataFrame that includes only the rows where the grade is less than 11
df_grade = df_copy[df_copy['grade'] < 11]

# Group the resulting DataFrame by grade and calculate the mean price for each group
df_grade = df_grade.groupby('grade')['price'].mean()

the overall code creates a new DataFrame called df_grade that includes only the rows from df_copy where the grade column is less than 11, and calculates the mean price for each grade level in the resulting DataFrame. This analysis can be useful for understanding the relationship between the grade variable and the price variable for low-grade houses.

In [52]:
# Create a new DataFrame that includes only the rows where the grade is less than 11
df_grade = df_copy[df_copy['grade'] < 11]

# Group the resulting DataFrame by grade and calculate the mean price for each group
df_grade = df_grade.groupby('grade')['price'].mean()

# Create a new figure with a specified size
plt.figure(figsize=(24, 10))

# Extract the x and y values for the bar plot
x = df_grade.index.to_list()
y = df_grade.to_list()

# Create a bar plot using Seaborn
sns.barplot(x=x, y=y)

# Add a title to the plot
plt.suptitle("Grade vs Average Price")

# Rotate the x-axis labels for better readability
plt.xticks(rotation=45)

# Add a grid to the plot
plt.grid(color='black', linestyle='--', linewidth=0.8, axis='y')

the overall code creates a bar plot that shows the relationship between the grade variable and the average price for houses with a grade less than 11 in the DataFrame df_copy. The resulting plot can be useful for understanding the relationship between the grade variable and the price variable for low-grade houses, and for identifying any patterns or trends in the data.

Bedrooms¶

In [53]:
# create count per bedrooms
df_copy['bedrooms'].value_counts()
Out[53]:
3     9823
4     6881
2     2760
5     1601
6      272
1      199
7       38
0       13
8       13
9        6
10       3
11       1
33       1
Name: bedrooms, dtype: int64

this code provides a quick and convenient way to obtain the count of the number of properties in df_copy with a certain number of bedrooms, which can be useful for understanding the distribution of properties across different numbers of bedrooms and for identifying any outliers or anomalies in the data.

In [54]:
# select when bedrooms<=8
df_new = df_copy[df_copy['bedrooms'] <=8 ]

This code creates a new DataFrame called df_new that includes only the rows from a DataFrame called df_copy where the value in the 'bedrooms' column is less than or equal to 8.

  1. The code accesses the 'bedrooms' column of the DataFrame df_copy using square bracket notation.

  2. The code then creates a Boolean mask by applying the comparison operator <= to the 'bedrooms' column and the integer value 8. This comparison operator returns a Boolean value of True or False for each row in the column, depending on whether the value in that row is less than or equal to 8.

  3. The Boolean mask is then used to select only the rows from df_copy where the value in the 'bedrooms' column is less than or equal to 8, creating a new DataFrame called df_new.

Overall, this code provides a way to filter a DataFrame to include only the rows that meet a certain condition, in this case, when the number of bedrooms is less than or equal to 8. This can be useful for cleaning and preparing data for analysis by removing any outliers or invalid data points.

In [55]:
# Group the DataFrame by bedrooms and calculate the mean price for each group
df_new = df_new.groupby('bedrooms')['price'].mean()

# Create a new figure with a specified size
plt.figure(figsize=(24, 10))

# Extract the x and y values for the bar plot
x = df_new.index.to_list()
y = df_new.to_list()

# Create a bar plot using Seaborn
sns.barplot(x=x, y=y)

# Add a title to the plot
plt.suptitle("Bedrooms vs Average Price")

# Rotate the x-axis labels for better readability
plt.xticks(rotation=45)

# Add a grid to the plot
plt.grid(color='red', linestyle='--', linewidth=0.8, axis='y')

the overall code creates a bar plot that shows the relationship between the bedrooms variable and the average price in the DataFrame df_new. The resulting plot can be useful for understanding the relationship between the bedrooms variable and the price variable, and for identifying any patterns or trends in the data.

Bathrooms¶

In [56]:
# create count to bathroom
df_copy['bathrooms'].value_counts()
Out[56]:
2.50000    5379
1.00000    3851
1.75000    3048
2.25000    2047
2.00000    1930
1.50000    1446
2.75000    1185
3.00000     753
3.50000     731
3.25000     589
3.75000     155
4.00000     136
4.50000     100
4.25000      79
0.75000      72
4.75000      23
5.00000      21
5.25000      13
0.00000      10
5.50000      10
1.25000       9
6.00000       6
0.50000       4
5.75000       4
6.75000       2
8.00000       2
6.25000       2
6.50000       2
7.50000       1
7.75000       1
Name: bathrooms, dtype: int64

This code creates a frequency table of the 'bathrooms' column in a Pandas DataFrame called df_copy.

  1. The first part of the code accesses the 'bathrooms' column of the DataFrame df_copy using square bracket notation.

  2. The value_counts() method is called on the 'bathrooms' column to count the number of occurrences of each unique value in the column and returns a Pandas Series where the unique values are the index and the corresponding counts are the values.

  3. The resulting frequency table is a Pandas Series object that is printed to the console, showing the number of times each unique value in the 'bathrooms' column appears in the DataFrame.

Overall, this code provides a quick and convenient way to obtain the count of the number of properties in df_copy with a certain number of bathrooms, which can be useful for understanding the distribution of properties across different numbers of bathrooms and for identifying any outliers or anomalies in the data.

In [57]:
# Filter the original DataFrame to include only rows with at least one bathroom
df_new = df_copy[df_copy['bathrooms'] >= 1]

# Filter the resulting DataFrame to exclude rows with more than 4 bathrooms
df_new = df_new[df_new['bathrooms'] < 5]

the overall code filters the original DataFrame to include only the rows with at least one bathroom, and then further filters the resulting DataFrame to exclude rows with more than 4 bathrooms. This can be useful for creating a new DataFrame that focuses on houses with a reasonable number of bathrooms, which could be relevant for some types of analyses.

In [58]:
# Group the DataFrame by bathrooms and calculate the mean price for each group
df_new = df_new.groupby('bathrooms')['price'].mean()

# Create a new figure with a specified size
plt.figure(figsize=(24, 10))

# Extract the x and y values for the bar plot
x = df_new.index.to_list()
y = df_new.to_list()

# Create a bar plot using Seaborn
sns.barplot(x=x, y=y)

# Add a title to the plot
plt.suptitle("bathrooms vs Average Price")

# Rotate the x-axis labels for better readability
plt.xticks(rotation=45)

# Add a grid to the plot
plt.grid(color='red', linestyle='--', linewidth=0.8, axis='y')

the overall code creates a bar plot that shows the relationship between the bathrooms variable and the average price in the DataFrame df_new. The resulting plot can be useful for understanding the relationship between the bathrooms variable and the price variable, and for identifying any patterns or trends in the data.

Floors¶

In [59]:
# Group the DataFrame by floors and calculate the mean price for each group
df_new = df_copy.groupby('floors')['price'].mean()

# Create a new figure with a specified size
plt.figure(figsize=(24, 10))

# Extract the x and y values for the bar plot
x = df_new.index.to_list()
y = df_new.to_list()

# Create a bar plot using Seaborn
sns.barplot(x=x, y=y)

# Add a title to the plot
plt.suptitle("floors vs Average Price")

# Rotate the x-axis labels for better readability
plt.xticks(rotation=45)

# Add a grid to the plot
plt.grid(color='red', linestyle='--', linewidth=0.8, axis='y')

the overall code creates a bar plot that shows the relationship between the floors variable and the average price in the DataFrame df_copy. The resulting plot can be useful for understanding the relationship between the floors variable and the price variable, and for identifying any patterns or trends in the data.

ConditionOfHouse¶

In [60]:
# Group the DataFrame by condition and calculate the mean price for each group
df_new = df_copy.groupby('condition')['price'].mean()

# Create a new figure with a specified size
plt.figure(figsize=(24, 10))

# Extract the x and y values for the bar plot
x = df_new.index.to_list()
y = df_new.to_list()

# Create a bar plot using Seaborn
sns.barplot(x=x, y=y)

# Add a title to the plot
plt.suptitle("condition vs Average Price")

# Rotate the x-axis labels for better readability
plt.xticks(rotation=45)

# Add a grid to the plot
plt.grid(color='black', linestyle='--', linewidth=0.8, axis='y')

the overall code creates a bar plot that shows the relationship between the condition variable and the average price in the DataFrame df_copy. The resulting plot can be useful for understanding the relationship between the condition variable and the price variable, and for identifying any patterns or trends in the data.

AveragePriceVsAllFeatures¶

In [61]:
# Create a new figure with a specified size
fig = plt.figure(figsize=(24, 70))

# Extract the column names of the DataFrame, excluding 'price'
y = df_copy.columns.to_list()
y.remove('price')

# Adjust the spacing between the subplots
ax1 = plt.subplots_adjust(hspace=0.7)

# Loop over each variable in the DataFrame
for i in y:
    # Group the DataFrame by the current variable and calculate the mean price for each group
    x = df_copy.groupby(i)['price'].mean()
    x = pd.DataFrame(x)
    
    # Calculate the index of the current variable in the list of column names
    a = y.index(i)
    
    # Create a new subplot with the appropriate title, labels, and color
    ax1 = plt.subplot(15, 2, a+1)
    ax1.set_ylabel("Average Price")
    ax1.set_xlabel(i)
    ax1.set_title("Average Price Vs {}".format(i))
    if y.index(i) % 2 != 0:
        col = 'red'
    else:
        col = 'blue'
    
    # Plot the mean price for each group as a line plot
    ax1.plot(x.index, x.price, color=col, label='Price', linewidth=10)

the overall code creates a grid of subplots that show the relationship between each variable in the DataFrame df_copy (except for the price variable) and the average price. The resulting grid of subplots can be useful for understanding the relationship between each variable and the price variable, and for identifying any patterns or trends in the data.

MachineLearning¶

FeatureEnginer¶

In [62]:
# Drop column Data
df_copy.drop('date',axis=1,inplace=True)
# print data
df_copy
Out[62]:
price bedrooms bathrooms sqft_living sqft_lot floors view condition grade sqft_above ... long sqft_living15 sqft_lot15 city state county population population_density tr_year tr_month
0 221900 3 1.00000 1180 5650 1.00000 0 3 7 1180.00000 ... -122.25700 1340 5650 Seattle WA King County 24092 4966 2014 10
1 538000 3 2.25000 2570 7242 2.00000 0 3 7 2170.00000 ... -122.31900 1690 7639 Seattle WA King County 37081 6879 2014 12
2 180000 2 1.00000 770 10000 1.00000 0 3 6 770.00000 ... -122.23300 2720 8062 Kenmore WA King County 20419 3606 2015 2
3 604000 4 3.00000 1960 5000 1.00000 0 5 7 1050.00000 ... -122.39300 1360 5000 Seattle WA King County 14770 6425 2014 12
4 510000 3 2.00000 1680 8080 1.00000 0 3 8 1680.00000 ... -122.04500 1800 7503 Sammamish WA King County 25748 2411 2015 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21608 360000 3 2.50000 1530 1131 3.00000 0 3 8 1530.00000 ... -122.34600 1530 1509 Seattle WA King County 45911 9905 2014 5
21609 400000 4 2.50000 2310 5813 2.00000 0 3 8 2310.00000 ... -122.36200 1830 7200 Seattle WA King County 25922 5573 2015 2
21610 402101 2 0.75000 1020 1350 2.00000 0 3 7 1020.00000 ... -122.29900 1020 2007 Seattle WA King County 26881 7895 2014 6
21611 400000 3 2.50000 1600 2388 2.00000 0 3 8 1600.00000 ... -122.06900 1410 1287 Issaquah WA King County 26141 469 2015 1
21612 325000 2 0.75000 1020 1076 2.00000 0 3 7 1020.00000 ... -122.29900 1020 1357 Seattle WA King County 26881 7895 2014 10

21611 rows × 24 columns

This code drops the 'date' column from a Pandas DataFrame called df_copy and then prints the resulting DataFrame to the console.

  1. The drop() method is called on df_copy to remove the 'date' column from the DataFrame. The axis=1 parameter specifies that the column should be dropped, and the inplace=True parameter specifies that the operation should be performed in place on the original DataFrame rather than returning a new DataFrame.

  2. The resulting DataFrame with the 'date' column removed is printed to the console using the print() function.

Overall, this code provides a way to remove a column from a DataFrame, which can be useful for cleaning and preparing data for analysis by removing any irrelevant or redundant columns.

In [63]:
# Create a label encoder object
le = LabelEncoder()

# Apply label encoding to the 'city' column in the dataframe 'df_copy'
df_copy['city_M'] = le.fit_transform(df_copy['city'])

# The label encoded values will be stored in a new column called 'city_M'

# Output the encoded dataframe
df_copy
Out[63]:
price bedrooms bathrooms sqft_living sqft_lot floors view condition grade sqft_above ... sqft_living15 sqft_lot15 city state county population population_density tr_year tr_month city_M
0 221900 3 1.00000 1180 5650 1.00000 0 3 7 1180.00000 ... 1340 5650 Seattle WA King County 24092 4966 2014 10 20
1 538000 3 2.25000 2570 7242 2.00000 0 3 7 2170.00000 ... 1690 7639 Seattle WA King County 37081 6879 2014 12 20
2 180000 2 1.00000 770 10000 1.00000 0 3 6 770.00000 ... 2720 8062 Kenmore WA King County 20419 3606 2015 2 10
3 604000 4 3.00000 1960 5000 1.00000 0 5 7 1050.00000 ... 1360 5000 Seattle WA King County 14770 6425 2014 12 20
4 510000 3 2.00000 1680 8080 1.00000 0 3 8 1680.00000 ... 1800 7503 Sammamish WA King County 25748 2411 2015 2 19
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21608 360000 3 2.50000 1530 1131 3.00000 0 3 8 1530.00000 ... 1530 1509 Seattle WA King County 45911 9905 2014 5 20
21609 400000 4 2.50000 2310 5813 2.00000 0 3 8 2310.00000 ... 1830 7200 Seattle WA King County 25922 5573 2015 2 20
21610 402101 2 0.75000 1020 1350 2.00000 0 3 7 1020.00000 ... 1020 2007 Seattle WA King County 26881 7895 2014 6 20
21611 400000 3 2.50000 1600 2388 2.00000 0 3 8 1600.00000 ... 1410 1287 Issaquah WA King County 26141 469 2015 1 9
21612 325000 2 0.75000 1020 1076 2.00000 0 3 7 1020.00000 ... 1020 1357 Seattle WA King County 26881 7895 2014 10 20

21611 rows × 25 columns

this code provides a way to encode categorical data into numerical values, which can be useful for machine learning algorithms that require numerical data as input. The label encoder assigns a unique integer to each category, with the smallest integer assigned to the most frequent category.

In [64]:
# Create a label encoder object
le = LabelEncoder()

# Apply label encoding to the 'state' column in the dataframe 'df_copy'
df_copy['state_M'] = le.fit_transform(df_copy['state'])

# The label encoded values will be stored in a new column called 'state_M'

# Output the encoded dataframe
df_copy
Out[64]:
price bedrooms bathrooms sqft_living sqft_lot floors view condition grade sqft_above ... sqft_lot15 city state county population population_density tr_year tr_month city_M state_M
0 221900 3 1.00000 1180 5650 1.00000 0 3 7 1180.00000 ... 5650 Seattle WA King County 24092 4966 2014 10 20 0
1 538000 3 2.25000 2570 7242 2.00000 0 3 7 2170.00000 ... 7639 Seattle WA King County 37081 6879 2014 12 20 0
2 180000 2 1.00000 770 10000 1.00000 0 3 6 770.00000 ... 8062 Kenmore WA King County 20419 3606 2015 2 10 0
3 604000 4 3.00000 1960 5000 1.00000 0 5 7 1050.00000 ... 5000 Seattle WA King County 14770 6425 2014 12 20 0
4 510000 3 2.00000 1680 8080 1.00000 0 3 8 1680.00000 ... 7503 Sammamish WA King County 25748 2411 2015 2 19 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21608 360000 3 2.50000 1530 1131 3.00000 0 3 8 1530.00000 ... 1509 Seattle WA King County 45911 9905 2014 5 20 0
21609 400000 4 2.50000 2310 5813 2.00000 0 3 8 2310.00000 ... 7200 Seattle WA King County 25922 5573 2015 2 20 0
21610 402101 2 0.75000 1020 1350 2.00000 0 3 7 1020.00000 ... 2007 Seattle WA King County 26881 7895 2014 6 20 0
21611 400000 3 2.50000 1600 2388 2.00000 0 3 8 1600.00000 ... 1287 Issaquah WA King County 26141 469 2015 1 9 0
21612 325000 2 0.75000 1020 1076 2.00000 0 3 7 1020.00000 ... 1357 Seattle WA King County 26881 7895 2014 10 20 0

21611 rows × 26 columns

This code performs label encoding on the values in the 'state' column of a Pandas DataFrame called df_copy.

  1. The first line of code creates a new LabelEncoder object called le. A label encoder is a preprocessing technique that assigns a unique integer value to each category in a categorical feature.

  2. The second line of code applies label encoding to the 'state' column in df_copy by calling the fit_transform() method of the le object on the 'state' column. The fit_transform() method fits the label encoder to the 'state' column and then transforms the category labels into numerical values.

  3. The resulting numerical values are stored in a new column called 'state_M' in df_copy. The column name 'state_M' stands for 'state' after being 'LabelEncoded'.

  4. The final line of code outputs the encoded DataFrame to the console using the print() function.

Overall, this code provides a way to encode categorical data into numerical values, which can be useful for machine learning algorithms that require numerical data as input. The label encoder assigns a unique integer to each category, with the smallest integer assigned to the most frequent category. In this case, the 'state_M' column contains the label encoded values for the 'state' column.

In [65]:
# Create a new feature called 'age' by subtracting the 'yr_built' column from the current year
df_copy['age'] = 2023 - df_copy['yr_built']

# Create a feature called 'total_sqft' that represents the total square footage of the property
df_copy['total_sqft'] = df_copy['sqft_living'] + df_copy['sqft_lot']

# Standardize the numerical features using the StandardScaler
scaler = StandardScaler()
# Choice Columns
num_cols = ['population','bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'lat', 'long', 'sqft_living15', 'sqft_lot15', 'age', 'total_sqft']
df_copy[num_cols] = scaler.fit_transform(df_copy[num_cols])
# Print the first few rows of the transformed DataFrame to inspect the data
df_copy.head()
Out[65]:
price bedrooms bathrooms sqft_living sqft_lot floors view condition grade sqft_above ... state county population population_density tr_year tr_month city_M state_M age total_sqft
0 221900 -0.39871 -1.44752 -0.97980 -0.22833 -0.91544 -0.30577 -0.62916 -0.55883 -0.73468 ... WA King County -0.59278 4966 2014 10 20 0 0.54501 -0.24904
1 538000 -0.39871 0.17556 0.53370 -0.18989 0.93644 -0.30577 -0.62916 -0.55883 0.46081 ... WA King County 0.57437 6879 2014 12 20 0 0.68120 -0.17734
2 180000 -1.47390 -1.44752 -1.42623 -0.12331 -0.91544 -0.30577 -0.62916 -1.40955 -1.22979 ... WA King County -0.92283 3606 2015 2 10 0 1.29403 -0.15431
3 604000 0.67648 1.14941 -0.13050 -0.24402 -0.91544 -0.30577 2.44426 -0.55883 -0.89167 ... WA King County -1.43043 6425 2014 12 20 0 0.20455 -0.24591
4 510000 -0.39871 -0.14905 -0.43538 -0.16966 -0.91544 -0.30577 -0.62916 0.29189 -0.13090 ... WA King County -0.44398 2411 2015 2 19 0 -0.54447 -0.17859

5 rows × 28 columns

This code performs several data preprocessing steps on a Pandas DataFrame called df_copy.

  1. The first line of code creates a new feature called 'age' by subtracting the 'yr_built' column from the current year (2023) and storing the result in a new 'age' column in the DataFrame. This calculates the age of each property in years.

  2. The second line of code creates a new feature called 'total_sqft' that represents the total square footage of the property by summing the 'sqft_living' and 'sqft_lot' columns and storing the result in a new 'total_sqft' column in the DataFrame.

  3. The third line of code creates a StandardScaler object called 'scaler' that will be used to standardize the numerical features in the DataFrame.

  4. The fourth line of code specifies a list of column names called 'num_cols' that contains the names of the numerical features in the DataFrame that should be standardized.

  5. The fifth line of code applies the fit_transform() method of the 'scaler' object to the columns specified in 'num_cols'. This method fits the scaler to the data and then transforms the data to have mean 0 and standard deviation 1.

  6. The resulting standardized numerical features are stored back in the 'num_cols' columns of the DataFrame.

  7. Finally, the last line of code prints the first few rows of the transformed DataFrame to the console using thehead() method to inspect the data.

Overall, this code performs several common data preprocessing steps, including creating new features, standardizing numerical features, and printing the resulting DataFrame. These steps are important for preparing data for analysis by machine learning algorithms, as they can help to improve the accuracy and interpretability of the results.

In [66]:
# create a subplot and figure size
fig,ax=plt.subplots(figsize=(15,10))
# Create a heatmap to corr
sns.heatmap(df_copy.corr(),annot=True,cmap='RdYlGn',fmt='.2f')
Out[66]:
<Axes: >

this code provides a convenient way to visualize the correlation between variables in a DataFrame using a heatmap, which can be useful for identifying patterns and relationships in the data. The correlation coefficients range from -1 to 1, with values closer to -1 indicating a negative correlation (inverse relationship), values closer to 1 indicating a positive correlation (direct relationship), and values closer to 0 indicating no correlation.

In [67]:
# choice model to select it to model
model_ft=['bedrooms', 'bathrooms', 'sqft_living','floors','view','grade', 'sqft_above', 'sqft_basement',
         'lat','sqft_lot15','population']
# print a columns 
print('We will use these Features to build the model : '+str(model_ft))
# print a size of columns
print('Number of features: '+str(len(model_ft)))
We will use these Features to build the model : ['bedrooms', 'bathrooms', 'sqft_living', 'floors', 'view', 'grade', 'sqft_above', 'sqft_basement', 'lat', 'sqft_lot15', 'population']
Number of features: 11

The model_ft variable is a list that contains the names of the features that will be used to build the model. These features were chosen based on their potential importance in predicting the target variable, which is not shown in this code.

The first print() statement outputs a message to the console that displays the list of features that will be used to build the model. The str() function is used to convert the list to a string before concatenating it with the rest of the message.

The second print() statement outputs a message to the console that displays the number of features that will be used to build the model, which is the length of the model_ft list.

Overall, this code provides a way to select a subset of features from a DataFrame for use in a machine learning model, which can help to improve the accuracy and interpretability of the model by reducing the number of irrelevant or redundant features.

RegressionModel¶

RegModelSelection¶

In [68]:
# Identify columns with NaN values in X
cols_with_nan = df_copy.columns[df_copy.isna().any()].tolist()

# Drop rows with NaN values from X and y
df_copy.dropna(subset=cols_with_nan, inplace=True)
# create a variable from dataset 'Feature'
X = df_copy[model_ft]
# create a variable from dataset 'Target'
y = df_copy['price']

This code performs data preprocessing tasks to prepare the dataset for machine learning modeling. First, it identifies the columns in the dataset that contain missing values using the isna() method and stores their names in the cols_with_nan list. It then drops all the rows in the dataset that contain missing values in any of the columns specified in cols_with_nan using the dropna() method, which modifies the original dataframe. The resulting dataset is then split into features and target variables using the model_ft list and the 'price' column, respectively. These variables are then used for machine learning modeling.

In [69]:
# Split the data into training and testing sets 
# split size Train ==80
# split size Test ==20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

this coode to split dataset to 80% train and 20 test and random state=42

LogisticRegressionModel¶

LogisticModelSelection¶

In [70]:
# create an instance of the Logistic Regression class
lor = LogisticRegression()

# fit the linear regression model to the scaled training data
lor.fit(X_train, y_train)

# use the trained model to make predictions on the scaled testing data
lor_pred = lor.predict(X_test)

# calculate the mean squared error (MSE), root mean squared error (RMSE), and R-squared (R2) scores for the predictions
lor_mse = mean_squared_error(y_test, lor_pred)
lor_rmse = mean_squared_error(y_test, lor_pred, squared=False)
lor_r2 = r2_score(y_test, lor_pred)

# print the results
print('Logistic Regression MSE: {:.2f}'.format(lor_mse))
print('Logistic Regression RMSE: {:.2f}'.format(lor_rmse))
print('Logistic Regression R2: {:.2f}'.format(lor_r2))
Logistic Regression MSE: 152306019.15
Logistic Regression RMSE: 12341.23
Logistic Regression R2: 1.00

This code trains and evaluates a logistic regression model using the scikit-learn library.

An instance of the LogisticRegression() class is created and assigned to the variable lor.

The fit() method is used to train the model on the training set, X_train and y_train.

The predict() method is then used to generate predictions, lor_pred, on the test set, X_test.

The mean_squared_error(), r2_score(), and sqrt() functions are used to compute the mean squared error (MSE), root mean squared error (RMSE), and R-squared (R2) scores for the predictions. These scores are printed to the console using the print() function.

LinearRegressionGraph¶

In [71]:
# create a scatter plot of the actual vs. predicted values for the linear regression model
plt.scatter(y_test, lor_pred)

# add labels and a title to the plot
plt.xlabel('Actual values')
plt.ylabel('Predicted values')
plt.title('Logistic Regression')

# display the plot
plt.show()

This code creates a scatter plot to visualize the performance of a logistic regression model.

The scatter() function from the matplotlib library is used to create a scatter plot of the actual values of the target variable, y_test, against the predicted values of the target variable, lor_pred.

The xlabel(), ylabel(), and title() functions are used to add appropriate axis labels and a title to the plot.

The resulting plot allows for a visual evaluation of the performance of the logistic regression model, and can be used to identify any patterns or trends in the model's predictions.

An analysis of the scatter plot can be used to identify any areas where the model may be over- or under-predicting the target variable, and can guide future improvements to the model.

LinearRegressionModel¶

LinearModelSelection¶

In [72]:
# create an instance of the LinearRegression class
lr = LinearRegression()

# fit the linear regression model to the scaled training data
lr.fit(X_train, y_train)

# use the trained model to make predictions on the scaled testing data
lr_pred = lr.predict(X_test)

# calculate the mean squared error (MSE), root mean squared error (RMSE), and R-squared (R2) scores for the predictions
lr_mse = mean_squared_error(y_test, lr_pred)
lr_rmse = mean_squared_error(y_test, lr_pred, squared=False)
lr_r2 = r2_score(y_test, lr_pred)

# print the results
print('Linear Regression MSE: {:.2f}'.format(lr_mse))
print('Linear Regression RMSE: {:.2f}'.format(lr_rmse))
print('Linear Regression R2: {:.2f}'.format(lr_r2))
Linear Regression MSE: 52838556699.08
Linear Regression RMSE: 229866.39
Linear Regression R2: 0.65

This code trains a linear regression model on training data and evaluates its performance on test data by computing three metrics: mean squared error (MSE), root mean squared error (RMSE), and R-squared (R2). The LinearRegression class is used to create the model instance, which is then fit to the training data using the fit() method. Predictions are made on the test data using the predict() method, and the metrics are computed using the appropriate functions. Finally, the metrics are printed to the console using the print() function.

LinearRegressionGraph¶

In [73]:
# create a scatter plot of the actual vs. predicted values for the linear regression model
plt.scatter(y_test, lr_pred)

# add labels and a title to the plot
plt.xlabel('Actual values')
plt.ylabel('Predicted values')
plt.title('Linear Regression')

# display the plot
plt.show()

This code creates a scatter plot to visualize the relationship between the actual and predicted target values for the linear regression model. The scatter() function from the Matplotlib library is used to create the plot. The x-axis corresponds to the actual values and the y-axis corresponds to the predicted values. The plot is labeled with appropriate axis labels and title using the xlabel(), ylabel(), and title() functions, and then displayed using the show() function.

GradientBoostingRegressor¶

GradientModelSelection¶

In [74]:
# Instantiate a Gradient Boosting Regressor model with 100 estimators and a random state of 42
gb = GradientBoostingRegressor(n_estimators=100, random_state=42)

# Train the model on the scaled training set
gb.fit(X_train, y_train)

# Use the trained model to predict on the scaled test set
gb_pred = gb.predict(X_test)

# Compute the mean squared error (MSE), root mean squared error (RMSE), and R-squared (R2) metrics
gb_mse = mean_squared_error(y_test, gb_pred)
gb_rmse = mean_squared_error(y_test, gb_pred, squared=False)
gb_r2 = r2_score(y_test, gb_pred)

# Print the computed metrics
print('Gradient Boosting MSE: {:.2f}'.format(gb_mse))
print('Gradient Boosting RMSE: {:.2f}'.format(gb_rmse))
print('Gradient Boosting R2: {:.2f}'.format(gb_r2))
Gradient Boosting MSE: 25071928541.03
Gradient Boosting RMSE: 158341.18
Gradient Boosting R2: 0.83

This code trains a Gradient Boosting Regressor model on the training set and evaluates its performance on the test set. The model is instantiated using the GradientBoostingRegressor class with 100 estimators and a random state of 42. The fit() method is used to train the model on the scaled training set, and the predict() method is used to generate predictions on the scaled test set. The code then computes three performance metrics (MSE, RMSE, and R2) using the appropriate functions from the scikit-learn library. Finally, the computed metrics are printed to the console using the print() function to evaluate the model's performance.

GradientBoostingGraph¶

In [75]:
# Visualize the predicted vs. actual values using a scatter plot
plt.scatter(y_test, gb_pred)

# Add axis labels and a title to the plot
plt.xlabel('Actual values')
plt.ylabel('Predicted values')
plt.title('Gradient Boosting Regression')

# Display the plot
plt.show()

This code creates a scatter plot to visualize the relationship between the actual and predicted target values for the Gradient Boosting Regressor model. The scatter() function from the Matplotlib library is used to create the plot. The x-axis corresponds to the actual values and the y-axis corresponds to the predicted values. The plot is labeled with appropriate axis labels and title using the xlabel(), ylabel(), and title() functions, and then displayed using the show() function.

NeuralNetworkRegressor¶

NeuralNetworkModelSelection¶

In [76]:
# Instantiate a neural network regressor model with two hidden layers of size 100 and 50, maximum iterations of 1000, and a random state of 42
nn = MLPRegressor(hidden_layer_sizes=(100,50), max_iter=1000, random_state=42)

# Train the model on the scaled training set
nn.fit(X_train, y_train)

# Use the trained model to predict on the scaled test set
nn_pred = nn.predict(X_test)

# Compute the mean squared error (MSE), root mean squared error (RMSE), and R-squared (R2) metrics
nn_mse = mean_squared_error(y_test, nn_pred)
nn_rmse = mean_squared_error(y_test, nn_pred, squared=False)
nn_r2 = r2_score(y_test, nn_pred)

# Print the computed metrics
print('Neural Network MSE: {:.2f}'.format(nn_mse))
print('Neural Network RMSE: {:.2f}'.format(nn_rmse))
print('Neural Network R2: {:.2f}'.format(nn_r2))
Neural Network MSE: 33912810579.48
Neural Network RMSE: 184154.31
Neural Network R2: 0.77

This code trains a neural network regressor model on the training set and evaluates its performance on the test set. The model is instantiated using the MLPRegressor class with two hidden layers of size 100 and 50, maximum iterations of 1000, and a random state of 42. The fit() method is used to train the model on the scaled training set, and the predict() method is used to generate predictions on the scaled test set. The code then computes three performance metrics (MSE, RMSE, and R2) using the appropriate functions from the scikit-learn library. Finally, the computed metrics are printed to the console using the print() function to evaluate the model's performance.

NeuralNetworkGraph¶

In [77]:
# Visualize the predicted vs. actual values using a scatter plot
plt.scatter(y_test, nn_pred)

# Add axis labels and a title to the plot
plt.xlabel('Actual values')
plt.ylabel('Predicted values')
plt.title('Neural Network Regression')

# Display the plot
plt.show()

This code creates a scatter plot to visualize the relationship between the actual and predicted target values for the neural network regressor model. The scatter() function from the Matplotlib library is used to create the plot. The x-axis corresponds to the actual values and the y-axis corresponds to the predicted values. The plot is labeled with appropriate axis labels and title using the xlabel(), ylabel(), and title() functions, and then displayed using the show() function.

RandomForestRegressor¶

RandomForestRegressorSelection¶

In [78]:
# Instantiate a random forest regressor model with 100 estimators and a random state of 42
rf = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model on the scaled training set
rf.fit(X_train, y_train)

# Use the trained model to predict on the scaled test set
rf_pred = rf.predict(X_test)

# Compute the mean squared error (MSE), root mean squared error (RMSE), and R-squared (R2) metrics
rf_mse = mean_squared_error(y_test, rf_pred)
rf_rmse = mean_squared_error(y_test, rf_pred, squared=False)
rf_r2 = r2_score(y_test, rf_pred)

# Print the computed metrics
print('Random Forest MSE: {:.2f}'.format(rf_mse))
print('Random Forest RMSE: {:.2f}'.format(rf_rmse))
print('Random Forest R2: {:.2f}'.format(rf_r2))
Random Forest MSE: 22318190987.17
Random Forest RMSE: 149392.74
Random Forest R2: 0.85

This code trains a random forest regressor model on the training set and evaluates its performance on the test set. The model is instantiated using the RandomForestRegressor class with 100 estimators and a random state of 42. The fit() method is used to train the model on the scaled training set, and the predict() method is used to generate predictions on the scaled test set. The code then computes three performance metrics (MSE, RMSE, and R2) using the appropriate functions from the scikit-learn library. Finally, the computed metrics are printed to the console using the print() function to evaluate the model's performance.

RandomForestRegressorGraph¶

In [79]:
# Visualize the predicted vs. actual values for the random forest regression model
plt.scatter(y_test, rf_pred)

# Set the x-axis label
plt.xlabel('Actual values')

# Set the y-axis label
plt.ylabel('Predicted values')

# Set the plot title
plt.title('Random Forest Regression')

# Show the plot
plt.show()

This code creates a scatter plot to visualize the relationship between the actual and predicted target values for the random forest regressor model. The scatter() function from the Matplotlib library is used to create the plot. The x-axis corresponds to the actual values and the y-axis corresponds to the predicted values. The plot is labeled with appropriate axis labels and title using the xlabel(), ylabel(), and title() functions, and then displayed using the show() function.

ExtraTreesRegressor¶

ExtraTreesRegressorSelection¶

In [80]:
#This line initializes an instance of the ExtraTreesRegressor class with the specified hyperparameters
ETR = ExtraTreesRegressor(n_estimators=500 , n_jobs= -1 , max_depth=24 ,min_samples_split=8 , min_samples_leaf=9 )
# Fit the support vector regression model to the scaled training set
ETR.fit(X_train, y_train)

# Predict the target values for the test set using the trained model
ETR_pred = ETR.predict(X_test)

# Compute and print the metrics for the model performance evaluation
ETR_mse = mean_squared_error(y_test, ETR_pred)
ETR_rmse = mean_squared_error(y_test, ETR_pred, squared=False)
ETR_r2 = r2_score(y_test, ETR_pred)

# Print the computed metrics
print('Extra Trees Regressor: {:.2f}'.format(ETR_mse))
print('Extra Trees Regression RMSE: {:.2f}'.format(ETR_rmse))
print('Extra Trees Regressor R2: {:.2f}'.format(ETR_r2))
Extra Trees Regressor: 32606447477.30
Extra Trees Regression RMSE: 180572.55
Extra Trees Regressor R2: 0.78

This code trains an Extra Trees Regressor model on the training set and evaluates its performance on the test set. The model is instantiated using the ExtraTreesRegressor class with the specified hyperparameters. The fit() method is used to train the model on the scaled training set, and the predict() method is used to generate predictions on the scaled test set. The code then computes three performance metrics (MSE, RMSE, and R2) using the appropriate functions from the scikit-learn library. Finally, the computed metrics are printed to the console using the print() function to evaluate the model's performance. The Extra Trees Regressor is a type of ensemble learning method that combines multiple decision trees to make more accurate predictions.

ExtraTreesRegressorGraph¶

In [81]:
# Visualize the predicted vs. actual values for the Support Vector Regression model
# Create a scatter plot with the actual values on the x-axis and the predicted values on the y-axis
plt.scatter(y_test,ETR_pred)

# Set the label for the x-axis
plt.xlabel('Actual values')

# Set the label for the y-axis
plt.ylabel('Predicted values')

# Set the title of the plot
plt.title('Support Vector Regression')

# Display the plot
plt.show()

This code creates a scatter plot to visualize the relationship between the actual and predicted target values for the Extra Trees Regressor model. The scatter() function from the Matplotlib library is used to create the plot. The x-axis corresponds to the actual values and the y-axis corresponds to the predicted values. The plot is labeled with appropriate axis labels and title using the xlabel(), ylabel(), and title() functions, and then displayed using the show() function.

RegressionCompare¶

In [82]:
# Compute accuracy score and visualize performance for each model
models = {'Logistic Regression':lor,'Linear Regression': lr, 'Gradient Boosting': gb, 'Neural Network': nn, 'Random Forest': rf, 'Extra Trees Regressor': ETR}
for name, model in models.items():
    score = model.score(X_test, y_test)
    print('{} Accuracy Score: {:.2f}'.format(name, score))
    
    pred = model.predict(X_test)
    plt.scatter(y_test, pred)
    plt.xlabel('Actual Salaries')
    plt.ylabel('Predicted Salaries')
    plt.title(name)
    plt.show()
Logistic Regression Accuracy Score: 0.01
Linear Regression Accuracy Score: 0.65
Gradient Boosting Accuracy Score: 0.83
Neural Network Accuracy Score: 0.77
Random Forest Accuracy Score: 0.85
Extra Trees Regressor Accuracy Score: 0.78

This code computes the accuracy score and visualizes the performance for each of the five models used to predict salaries. A dictionary is created containing the name of each model along with its corresponding instance.

For each model, the score() method is called on the test set to calculate the accuracy score, which measures how well the model is able to predict the target variable. The accuracy score is printed to the console using the print() function.

Next, the predict() method is called on the trained model with the test set as an argument to generate predictions. A scatter plot is created using the scatter() function to visualize the relationship between the actual and predicted target values. The x-axis corresponds to the actual values and the y-axis corresponds to the predicted values. The plot is labeled with appropriate axis labels and title using the xlabel(), ylabel(), and title() functions, and then displayed using the show() function.

This process is repeated for each of the five models, allowing for a comparison of their accuracy scores and visual performance. The scatter plots provide a visual representation of how well the models are able to predict salaries, with a more tightly clustered distribution of points indicating better performance.

Diffrence between measures of accuracy and goodness of fit in regression analysis¶

Mean squared error (MSE):¶

MSE measures the average squared difference between the predicted values and the actual values in a regression model. It is calculated by taking the sum of the squared differences between the predicted and actual values and dividing by the number of observations.

MSE = (1/n) * ∑(y - ŷ)^2¶

where y is the actual value, ŷ is the predicted value, and n is the number of observations.

MSE is useful for comparing different models as it penalizes large errors more than small errors. However, MSE has the disadvantage of being difficult to interpret as it is expressed in squared units.

Root mean squared error (RMSE):¶

RMSE is the square root of MSE and is therefore expressed in the same units as the dependent variable. RMSE is a popular metric for evaluating the accuracy of predictive models. It is calculated by taking the square root of the MSE.

RMSE = sqrt(MSE)¶

RMSE is useful because it gives a meaningful interpretation of the magnitude of the prediction errors. However, like MSE, RMSE also does not take into account the variability of the data.

R-squared (R2):¶

R-squared is a measure of how well the regression model fits the data. It is the proportion of the variance in the dependent variable that is explained by the independent variable(s). R-squared ranges from 0 to 1, with 1 indicating a perfect fit and 0 indicating no fit at all.

R2 = 1 - (SSres/SStot)¶

In [83]:
# Define the models
models = [lor,lr, gb, nn, rf, ETR]
model_names = ['LOG','LR', 'GB', 'NN', 'RF', 'ETR']

# Create empty lists to store the evaluation metrics
mse_scores = []
rmse_scores = []
r2_scores = []

# Evaluate each model
for model in models:
    # Mean Squared Error (MSE)
    mse = mean_squared_error(y_test, model.predict(X_test))
    mse_scores.append(mse)
    
    # Root Mean Squared Error (RMSE)
    rmse = mean_squared_error(y_test, model.predict(X_test), squared=False)
    rmse_scores.append(rmse)
    
    # R-Squared (R2) Score
    r2 = r2_score(y_test, model.predict(X_test))
    r2_scores.append(r2)
      
# Create a dataframe to store the evaluation metrics
evaluation_df = pd.DataFrame({'Model': model_names,
                              'MSE': mse_scores,
                              'RMSE': rmse_scores,
                              'R2': r2_scores
                              })

# Print the evaluation metrics for each model
print(evaluation_df)

# Create a bar plot to compare the MSE scores of the models
plt.bar(model_names, mse_scores)
plt.title('Mean Squared Error')
plt.xlabel('Model')
plt.ylabel('MSE')
plt.show()

# Create a bar plot to compare the RMSE scores of the models
plt.bar(model_names, rmse_scores)
plt.title('Root Mean Squared Error')
plt.xlabel('Model')
plt.ylabel('RMSE')
plt.show()

# Create a bar plot to compare the R2 scores of the models
plt.bar(model_names, r2_scores)
plt.title('R-Squared Score')
plt.xlabel('Model')
plt.ylabel('R2')
plt.show()


# Print the best model for each evaluation metric
best_mse_model = evaluation_df.loc[evaluation_df['MSE'].idxmin(), 'Model']
best_rmse_model = evaluation_df.loc[evaluation_df['RMSE'].idxmin(), 'Model']
best_r2_model = evaluation_df.loc[evaluation_df['R2'].idxmax(), 'Model']


print('Best Model (MSE):', best_mse_model)
print('Best Model (RMSE):',best_rmse_model)
print('Best Model (r2):',best_r2_model)
  Model               MSE         RMSE      R2
0   LOG   152306019.14851  12341.23248 0.99898
1    LR 52838556699.08271 229866.38880 0.64633
2    GB 25071928541.02862 158341.17765 0.83218
3    NN 33912810579.47690 184154.31187 0.77301
4    RF 22318190987.17298 149392.74074 0.85062
5   ETR 32606447477.29959 180572.55461 0.78175
Best Model (MSE): LOG
Best Model (RMSE): LOG
Best Model (r2): LOG

This code evaluates the performance of five different models on the task of predicting salaries. Three different evaluation metrics are used, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-Squared (R2) Score, to compare the performance of these models.

For each model, the code computes these three evaluation metrics and stores them in lists. These lists are then used to create a pandas DataFrame that summarizes the performance of each model.

The code also creates three bar plots to visualize the performance of each model for each evaluation metric. Finally, the code prints the best model for each evaluation metric based on the results obtained from the evaluation DataFrame.

This allows for a comprehensive comparison of the performance of each model, enabling users to choose the best model for the prediction task at hand.

In [84]:
#Create scatter plot for each model

plt.scatter(y_test, lor_pred, label='Logistic Regression')
plt.scatter(y_test, lr_pred, label='Linear Regression')
plt.scatter(y_test, gb_pred, label='Gradient Boosting')
plt.scatter(y_test, nn_pred, label='Neural Network')
plt.scatter(y_test, rf_pred, label='Random Forest')
plt.scatter(y_test, ETR_pred, label='Extra Trees Regressor')

#Set plot labels and title
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Comparison of Regression Models based on Actual vs. Predicted values')

#Add legend to the plot
plt.legend()

#Display the plot
plt.show()

This code creates a scatter plot to compare the performance of five different regression models used to predict salaries. The scatter() function from the Matplotlib library is used to create the plot, with each model's predicted values plotted against the actual values on the x- and y-axes, respectively.

The code then uses the label parameter to add a label to each model's scatter plot. The xlabel(), ylabel(), and title() functions are used to set appropriate axis labels and title for the plot.

Finally, the legend() function is used to add a legend to the plot that identifies each model's scatter plot. The legend helps to distinguish between the different models and their corresponding scatter plots. The resulting plot allows for a visual comparison of the performance of each model, providing insights into which models are better suited for the prediction task at hand.

ClassificationModel¶

In [85]:
# Take copy of data to classification
df_copy1=df_copy.copy(deep=True)
# print Dataset to check  copy
df_copy1
Out[85]:
price bedrooms bathrooms sqft_living sqft_lot floors view condition grade sqft_above ... state county population population_density tr_year tr_month city_M state_M age total_sqft
0 221900 -0.39871 -1.44752 -0.97980 -0.22833 -0.91544 -0.30577 -0.62916 -0.55883 -0.73468 ... WA King County -0.59278 4966 2014 10 20 0 0.54501 -0.24904
1 538000 -0.39871 0.17556 0.53370 -0.18989 0.93644 -0.30577 -0.62916 -0.55883 0.46081 ... WA King County 0.57437 6879 2014 12 20 0 0.68120 -0.17734
2 180000 -1.47390 -1.44752 -1.42623 -0.12331 -0.91544 -0.30577 -0.62916 -1.40955 -1.22979 ... WA King County -0.92283 3606 2015 2 10 0 1.29403 -0.15431
3 604000 0.67648 1.14941 -0.13050 -0.24402 -0.91544 -0.30577 2.44426 -0.55883 -0.89167 ... WA King County -1.43043 6425 2014 12 20 0 0.20455 -0.24591
4 510000 -0.39871 -0.14905 -0.43538 -0.16966 -0.91544 -0.30577 -0.62916 0.29189 -0.13090 ... WA King County -0.44398 2411 2015 2 19 0 -0.54447 -0.17859
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21608 360000 -0.39871 0.50018 -0.59871 -0.33743 2.78832 -0.30577 -0.62916 0.29189 -0.31203 ... WA King County 1.36782 9905 2014 5 20 0 -1.29349 -0.34928
21609 400000 0.67648 0.50018 0.25059 -0.22439 0.93644 -0.30577 -0.62916 0.29189 0.62987 ... WA King County -0.42834 5573 2015 2 20 0 -1.46372 -0.21795
21610 402101 -1.47390 -1.77214 -1.15402 -0.33214 0.93644 -0.30577 -0.62916 -0.55883 -0.92789 ... WA King County -0.34217 7895 2014 6 20 0 -1.29349 -0.35628
21611 400000 -0.39871 0.50018 -0.52249 -0.30708 0.93644 -0.30577 -0.62916 0.29189 -0.22750 ... WA King County -0.40867 469 2015 1 9 0 -1.12326 -0.31737
21612 325000 -1.47390 -1.77214 -1.15402 -0.33876 0.93644 -0.30577 -0.62916 -0.55883 -0.92789 ... WA King County -0.34217 7895 2014 10 20 0 -1.25945 -0.36287

21611 rows × 28 columns

this code provides a way to create a new DataFrame that is a copy of an existing DataFrame, which can be useful for making changes to the data without affecting the original DataFrame. In this case, the copy is created to prepare the data for classification tasks.

In [86]:
#Define the quantile boundaries
#q = [0, 0.25, 0.5, 0.75, 1]
q=[0,0.33,0.66,1]

#Define the bin labels
labels = ['SalaryA', 'SalaryB', 'SalaryC', ]

#Perform binning on the 'price' column and store the result in a new column 'price'
df_copy1['price'] = pd.qcut(df_copy1['price'], q=q, labels=labels)

#Display the updated dataframe
df_copy1
Out[86]:
price bedrooms bathrooms sqft_living sqft_lot floors view condition grade sqft_above ... state county population population_density tr_year tr_month city_M state_M age total_sqft
0 SalaryA -0.39871 -1.44752 -0.97980 -0.22833 -0.91544 -0.30577 -0.62916 -0.55883 -0.73468 ... WA King County -0.59278 4966 2014 10 20 0 0.54501 -0.24904
1 SalaryB -0.39871 0.17556 0.53370 -0.18989 0.93644 -0.30577 -0.62916 -0.55883 0.46081 ... WA King County 0.57437 6879 2014 12 20 0 0.68120 -0.17734
2 SalaryA -1.47390 -1.44752 -1.42623 -0.12331 -0.91544 -0.30577 -0.62916 -1.40955 -1.22979 ... WA King County -0.92283 3606 2015 2 10 0 1.29403 -0.15431
3 SalaryC 0.67648 1.14941 -0.13050 -0.24402 -0.91544 -0.30577 2.44426 -0.55883 -0.89167 ... WA King County -1.43043 6425 2014 12 20 0 0.20455 -0.24591
4 SalaryB -0.39871 -0.14905 -0.43538 -0.16966 -0.91544 -0.30577 -0.62916 0.29189 -0.13090 ... WA King County -0.44398 2411 2015 2 19 0 -0.54447 -0.17859
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21608 SalaryA -0.39871 0.50018 -0.59871 -0.33743 2.78832 -0.30577 -0.62916 0.29189 -0.31203 ... WA King County 1.36782 9905 2014 5 20 0 -1.29349 -0.34928
21609 SalaryB 0.67648 0.50018 0.25059 -0.22439 0.93644 -0.30577 -0.62916 0.29189 0.62987 ... WA King County -0.42834 5573 2015 2 20 0 -1.46372 -0.21795
21610 SalaryB -1.47390 -1.77214 -1.15402 -0.33214 0.93644 -0.30577 -0.62916 -0.55883 -0.92789 ... WA King County -0.34217 7895 2014 6 20 0 -1.29349 -0.35628
21611 SalaryB -0.39871 0.50018 -0.52249 -0.30708 0.93644 -0.30577 -0.62916 0.29189 -0.22750 ... WA King County -0.40867 469 2015 1 9 0 -1.12326 -0.31737
21612 SalaryA -1.47390 -1.77214 -1.15402 -0.33876 0.93644 -0.30577 -0.62916 -0.55883 -0.92789 ... WA King County -0.34217 7895 2014 10 20 0 -1.25945 -0.36287

21611 rows × 28 columns

This code performs quantile-based binning on the price column of a given dataframe df_copy1. The q parameter specifies the quantile boundaries used for binning, and the labels parameter specifies the labels for the resulting bins.

The pd.qcut() function is used to perform the binning operation on the price column, and the resulting output is stored in a new column called price. The qcut() function creates bins by dividing the data into intervals with an equal number of observations in each bin.

The resulting updated dataframe displays the new price column with the corresponding bin labels for each observation, allowing for further analysis and visualization of the data.

In [87]:
# choice columns to enter it to model
model_ft=['bedrooms', 'bathrooms', 'sqft_living','floors','view','grade', 'sqft_above', 'sqft_basement',
         'lat','sqft_lot15','population']
# print colums 
print('We will use these Features to build the model : '+str(model_ft))
# print count of colums  
print('Number of features: '+str(len(model_ft)))
We will use these Features to build the model : ['bedrooms', 'bathrooms', 'sqft_living', 'floors', 'view', 'grade', 'sqft_above', 'sqft_basement', 'lat', 'sqft_lot15', 'population']
Number of features: 11

This code defines a list of features to be used in building a prediction model. The list contains the names of the features that are most relevant to predicting the target variable (i.e., house price).

The print() function is used to display the list of features and the number of features in the list. This information is important for understanding the model's input data and how it is being used to make predictions.

By selecting only the most relevant features, the model can reduce the dimensionality of the input data, leading to faster training times and potentially better model performance. Additionally, using fewer features can help to avoid overfitting and reduce the risk of the model making predictions based on noise or irrelevant data.

In [88]:
# Identify columns with NaN values in X
cols_with_nan = df_copy1.columns[df_copy1.isna().any()].tolist()

# Drop rows with NaN values from X and y
df_copy1.dropna(subset=cols_with_nan, inplace=True)
# identify who is x
X = df_copy1[model_ft]
# identify who is y [target]
y = df_copy1['price']

This code identifies the columns in a given dataframe df_copy1 that contain NaN (Not a Number) values using the isna() and any() functions. The resulting columns are stored in a list called cols_with_nan.

The code then drops any rows with missing values from both the X and y datasets using the dropna() function, which removes any observations with missing values in the specified columns. This ensures that the model is trained on complete data.

Finally, the X and y variables are assigned to the model_ft and price columns of the updated dataframe, respectively. This allows the model to use the relevant features (model_ft) to predict the target variable (price).

In [89]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

This code splits the data into training and testing sets using the train_test_split() function from the scikit-learn library. The X and y variables are split into separate training and testing sets, with 80% of the data used for training and 20% used for testing. The random_state parameter is set to 42 to ensure reproducibility of the results. This allows the model to be trained on a subset of the data and tested on the remaining data to evaluate its performance.

LogisticRegressionCLFModel¶

LogisticModelCLFSelection¶

In [90]:
# create an instance of the Logistic Regression class
lor = LogisticRegression()

# fit the linear regression model to the scaled training data
lor.fit(X_train, y_train)

# use the trained model to make predictions on the scaled testing data
y_pred_LOG= lor.predict(X_test)
# craete a accurcy variable 
accuracyLOG = accuracy_score(y_test, y_pred_LOG)
# Print the name of the model and its accuracy on the test data
print('Logistic Regression Accurcy: ',accuracyLOG*100)
Logistic Regression Accurcy:  71.96391394864678

This code trains a Logistic Regressin Classifier model on the training set, X_train and y_train, using the LogisticRegression() function from the scikit-learn library. The fit() method is used to train the model on the training set, and the predict() method is used to generate predictions, y_pred_LOG, on the test set, X_test.

The accuracy_score() function is then used to compute the accuracy of the model's predictions on the test set, and the resulting value is stored in the variable accuracyDT. Finally, the print() function is used to display the name of the model and its accuracy on the test data in percentage format.

This allows for the evaluation of the Logistic Regression Classifier model's performance on the prediction task at hand.

LogisticModelCLFGraph¶

In [91]:
# create confusion matrix
cm = confusion_matrix(y_test, y_pred_LOG)
# Plot a heatmap to draw confusion matrix 
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues')
# create a xlabel name 
plt.xlabel('Predicted label')
# create a ylable name
plt.ylabel('True label')
# create a title name
plt.title('Logistic Regression Confusion Matrix')
# display
plt.show()
# Create confusion matrix and classification report
cm = confusion_matrix(y_test, y_pred_LOG)
#create a classification report
cr = classification_report(y_test, y_pred_LOG)
print(cm)
print(cr)
    
[[1139  265   14]
 [ 278  885  291]
 [  16  348 1087]]
              precision    recall  f1-score   support

     SalaryA       0.79      0.80      0.80      1418
     SalaryB       0.59      0.61      0.60      1454
     SalaryC       0.78      0.75      0.76      1451

    accuracy                           0.72      4323
   macro avg       0.72      0.72      0.72      4323
weighted avg       0.72      0.72      0.72      4323

This code trains a Logistic Regression Classifier model on the training set, X_train and y_train, using the DecisionTreeClassifier() function from the scikit-learn library. The fit() method is used to train the model on the training set, and the predict() method is used to generate predictions, y_pred_LOG, on the test set, X_test.

The accuracy_score() function is then used to compute the accuracy of the model's predictions on the test set, and the resulting value is stored in the variable accuracyLOG. Finally, the print() function is used to display the name of the model and its accuracy on the test data in percentage format.

This allows for the evaluation of the Decision Tree Classifier model's performance on the prediction task at hand.

DecisionTreeClassifier¶

DecisionTreeClassifierSelection¶

In [92]:
# create moodle
dtc = DecisionTreeClassifier()
#fitting decision tree moodle
dtc.fit(X_train, y_train)
# create varaiable and create and store the predict in it
y_pred_DT=dtc.predict(X_test)
# craete a accurcy variable 
accuracyDT = accuracy_score(y_test, y_pred_DT)
# Print the name of the model and its accuracy on the test data
print('Decision Tree Accurcy: ',accuracyDT*100)
Decision Tree Accurcy:  75.15614156835531

This code trains a Decision Tree Classifier model on the training set, X_train and y_train, using the DecisionTreeClassifier() function from the scikit-learn library. The fit() method is used to train the model on the training set, and the predict() method is used to generate predictions, y_pred_DT, on the test set, X_test.

The accuracy_score() function is then used to compute the accuracy of the model's predictions on the test set, and the resulting value is stored in the variable accuracyDT. Finally, the print() function is used to display the name of the model and its accuracy on the test data in percentage format.

This allows for the evaluation of the Decision Tree Classifier model's performance on the prediction task at hand.

DecisionTreeClassifierGraph¶

In [93]:
# create confusion matrix
cm = confusion_matrix(y_test, y_pred_DT)
# Plot a heatmap to draw confusion matrix 
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues')
# create a xlabel name 
plt.xlabel('Predicted label')
# create a ylable name
plt.ylabel('True label')
# create a title name
plt.title('Decision Tree Confusion Matrix')
# display
plt.show()
# Create confusion matrix and classification report
cm = confusion_matrix(y_test, y_pred_DT)
#create a classification report
cr = classification_report(y_test, y_pred_DT)
print(cm)
print(cr)
    
[[1133  263   22]
 [ 265  929  260]
 [  24  240 1187]]
              precision    recall  f1-score   support

     SalaryA       0.80      0.80      0.80      1418
     SalaryB       0.65      0.64      0.64      1454
     SalaryC       0.81      0.82      0.81      1451

    accuracy                           0.75      4323
   macro avg       0.75      0.75      0.75      4323
weighted avg       0.75      0.75      0.75      4323

This code creates a confusion matrix to evaluate the performance of a Decision Tree Classifier model on the test set. The confusion_matrix() function from the scikit-learn library is used to compute the confusion matrix, and the resulting matrix is stored in the variable cm.

The heatmap() function from the seaborn library is then used to create a heatmap visualization of the confusion matrix, with the annot=True parameter used to display the values of the matrix in each cell. Appropriate axis labels and a title are added to the plot using the xlabel(), ylabel(), and title() functions.

The code also creates a classification report using the classification_report() function from the scikit-learn library, which provides a detailed summary of the model's performance on each class in the target variable. The resulting confusion matrix and classification report provide insights into the model's performance and can be used to identify areas for improvement in the model.

RandomForestClassifier¶

RandomForestClassifierSelection¶

In [94]:
# create random forest model
rfc = RandomForestClassifier()
# fitting model
rfc.fit(X_train, y_train)
# create a y_pred_RF to store in predict
y_pred_RF=rfc.predict(X_test)
# create a ccuracy variable to store accuracy score 
accuracyRF = accuracy_score(y_test, y_pred_RF)
    # Print the name of the model and its accuracy on the test data
print('Random Forest Accurcy: ',accuracyRF*100)
Random Forest Accurcy:  81.72565348137867

RandomForestClassifierGraph¶

In [95]:
# create a confusion matrix
cm = confusion_matrix(y_test, y_pred_RF)
    
    # Plot confusion matrix
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues')
# create xlabel name
plt.xlabel('Predicted label')
# create a ylabel name
plt.ylabel('True label')
# create title name
plt.title('Random Forest Confusion Matrix')
# display
plt.show()
# Create confusion matrix classification report
cm = confusion_matrix(y_test, y_pred_RF)
# create a classification report
cr = classification_report(y_test, y_pred_RF)
print(cm)
print(cr)
    
[[1192  218    8]
 [ 175 1087  192]
 [   5  192 1254]]
              precision    recall  f1-score   support

     SalaryA       0.87      0.84      0.85      1418
     SalaryB       0.73      0.75      0.74      1454
     SalaryC       0.86      0.86      0.86      1451

    accuracy                           0.82      4323
   macro avg       0.82      0.82      0.82      4323
weighted avg       0.82      0.82      0.82      4323

This code creates a confusion matrix to evaluate the performance of a Random Forest Classifier model on the test set. The confusion_matrix() function from the scikit-learn library is used to compute the confusion matrix, and the resulting matrix is stored in the variable cm.

The heatmap() function from the seaborn library is then used to create a heatmap visualization of the confusion matrix, with the annot=True parameter used to display the values of the matrix in each cell. Appropriate axis labels and a title are added to the plot using the xlabel(), ylabel(), and title() functions.

The code also creates a classification report using the classification_report() function from the scikit-learn library, which provides a detailed summary of the model's performance on each class in the target variable. The resulting confusion matrix and classification report provide insights into the model's performance and can be used to identify areas for improvement in the model.

GradientBoostingClassifier¶

GradientBoostingClassifierSelection¶

In [96]:
# create a Gradient Boosting Classifier
gbc = GradientBoostingClassifier()
# fitting model
gbc.fit(X_train, y_train)
#stor predict variable in Y_pred
y_pred_GB=gbc.predict(X_test)
# store accurcy score in variable
accuracyGB = accuracy_score(y_test, y_pred_GB)
    # Print the name of the model and its accuracy on the test data
print('Random Forest Accurcy: ',accuracyGB*100)
Random Forest Accurcy:  81.74878556557947

This code creates a Gradient Boosting Classifier model using the GradientBoostingClassifier() function from the scikit-learn library. The fit() method is used to train the model on the training set, X_train and y_train.

The predict() method is then used to generate predictions, y_pred_GB, on the test set, X_test. The accuracy_score() function is used to compute the accuracy of the model's predictions on the test set, and the resulting value is stored in the variable accuracyGB.

Finally, the print() function is used to display the name of the model and its accuracy on the test data in percentage format. This allows for the evaluation of the Gradient Boosting Classifier model's performance on the prediction task at hand.

GradientBoostingClassifierGraph¶

In [97]:
# create confusion matrix
cm = confusion_matrix(y_test, y_pred_GB)
    
# Plot confusion matrix
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues')
# create xlabel Name Predicted label
plt.xlabel('Predicted label')
# create ylabel True label
plt.ylabel('True label')
# create title Gradient Boosting Confusion Matrix
plt.title('Gradient Boosting Confusion Matrix')
# display confusion matrix
plt.show()
# Create confusion matrix 
cm = confusion_matrix(y_test, y_pred_GB)
# Create classification report
cr = classification_report(y_test, y_pred_GB)
# print confusion_matrix  
print(cm)
# print classification_report
print(cr)
    
[[1208  203    7]
 [ 172 1096  186]
 [   4  217 1230]]
              precision    recall  f1-score   support

     SalaryA       0.87      0.85      0.86      1418
     SalaryB       0.72      0.75      0.74      1454
     SalaryC       0.86      0.85      0.86      1451

    accuracy                           0.82      4323
   macro avg       0.82      0.82      0.82      4323
weighted avg       0.82      0.82      0.82      4323

This code creates a confusion matrix to evaluate the performance of a Gradient Boosting Classifier model on the test set. The confusion_matrix() function from the scikit-learn library is used to compute the confusion matrix, and the resulting matrix is stored in the variable cm.

The heatmap() function from the seaborn library is then used to create a heatmap visualization of the confusion matrix, with the annot=True parameter used to display the values of the matrix in each cell. Appropriate axis labels and a title are added to the plot using the xlabel(), ylabel(), and title() functions.

The code also creates a classification report using the classification_report() function from the scikit-learn library, which provides a detailed summary of the model's performance on each class in the target variable. The resulting confusion matrix and classification report provide insights into the model's performance and can be used to identify areas for improvement in the model.

The resulting confusion matrix and classification report allow for the evaluation of the Gradient Boosting Classifier model's performance on the prediction task at hand, and can be used to compare its performance to other models in the analysis.

AdaBoostClassifier¶

AdaBoostClassifierSelection¶

In [98]:
# create AdaBoostClassifier
abc = AdaBoostClassifier()
# fitting model
abc.fit(X_train, y_train)
# store predict value into y_pred_AB variable 
y_pred_AB=abc.predict(X_test)
# store accuracy_score value into accuracyAB variable 
accuracyAB = accuracy_score(y_test, y_pred_AB)
    # Print the name of the model and its accuracy on the test data
print('AdaBoost Accurcy: ',accuracyAB*100)
AdaBoost Accurcy:  78.0707841776544

This code creates an AdaBoost Classifier model using the AdaBoostClassifier() function from the scikit-learn library. The fit() method is used to train the model on the training set, X_train and y_train.

The predict() method is then used to generate predictions, y_pred_AB, on the test set, X_test. The accuracy_score() function is used to compute the accuracy of the model's predictions on the test set, and the resulting value is stored in the variable accuracyAB.

Finally, the print() function is used to display the name of the model and its accuracy on the test data in percentage format. This allows for the evaluation of the AdaBoost Classifier model's performance on the prediction task at hand.

By creating and training an AdaBoost Classifier, the code is exploring the use of an ensemble learning technique that combines multiple "weak" models to create a stronger predictor. AdaBoost works by iteratively adjusting the weights of misclassified observations to focus on those that are most difficult to predict. This can lead to improved accuracy and performance compared to using a single model.

AdaBoostClassifierGraph¶

In [99]:
# create confusion_matrix
cm = confusion_matrix(y_test, y_pred_AB)
    
# Plot confusion matrix using heatmap
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues')
# create xlabel Name Predicted label
plt.xlabel('Predicted label')
# create ylabel True label
plt.ylabel('True label')
# create title AdaBoost Confusion Matrix
plt.title('AdaBoost Confusion Matrix')
# display confusion matrix
plt.show()
# Create confusion matrix 
cm = confusion_matrix(y_test, y_pred_AB)
# Create classification report
cr = classification_report(y_test, y_pred_AB)
# print confusion_matrix
print(cm)
# print classification_report
print(cr)
    
[[1128  279   11]
 [ 164 1087  203]
 [   8  283 1160]]
              precision    recall  f1-score   support

     SalaryA       0.87      0.80      0.83      1418
     SalaryB       0.66      0.75      0.70      1454
     SalaryC       0.84      0.80      0.82      1451

    accuracy                           0.78      4323
   macro avg       0.79      0.78      0.78      4323
weighted avg       0.79      0.78      0.78      4323

This code creates a confusion matrix to evaluate the performance of an AdaBoost Classifier model on the test set. The confusion_matrix() function from the scikit-learn library is used to compute the confusion matrix, and the resulting matrix is stored in the variable cm.

The heatmap() function from the seaborn library is then used to create a heatmap visualization of the confusion matrix, with the annot=True parameter used to display the values of the matrix in each cell. Appropriate axis labels and a title are added to the plot using the xlabel(), ylabel(), and title() functions.

The code also creates a classification report using the classification_report() function from the scikit-learn library, which provides a detailed summary of the model's performance on each class in the target variable. The resulting confusion matrix and classification report provide insights into the model's performance and can be used to identify areas for improvement in the model.

The resulting confusion matrix and classification report allow for the evaluation of the AdaBoost Classifier model's performance on the prediction task at hand, and can be used to compare its performance to other models in the analysis.

By creating and evaluating the performance of an AdaBoost Classifier, the code is exploring the use of an ensemble learning technique that combines multiple weak models to create a stronger predictor. AdaBoost works by iteratively adjusting the weights of misclassified observations to focus on those that are most difficult to predict. This can lead to improved accuracy and performance compared tousing a single model.

SupportVectorClassifier¶

SupportVectorClassifierSelection¶

In [100]:
# create SupportVectorClassifier
svc = SVC()
# fitting model
svc.fit(X_train, y_train)
# store predict value into y_pred_SV variable 
y_pred_SV=svc.predict(X_test)
# store accuracy_score value into accuracySVC variable 
accuracySVC = accuracy_score(y_test, y_pred_SV)
    # Print the name of the model and its accuracy on the test data
print('Support Vector Accurcy: ',accuracySVC*100)
Support Vector Accurcy:  79.15799213509138

This code creates a Support Vector Machine (SVM) model using the SVC() function from the scikit-learn library. The fit() method is used to train the model on the training set, X_train and y_train.

The predict() method is then used to generate predictions, y_pred_SV, on the test set, X_test. The accuracy_score() function is used to compute the accuracy of the model's predictions on the test set, and the resulting value is stored in the variable accuracySVC.

Finally, the print() function is used to display the name of the model and its accuracy on the test data in percentage format. This allows for the evaluation of the SVM model's performance on the prediction task at hand.

SupportVectorClassifierGraph¶

In [101]:
# create confusion_matrix
cm = confusion_matrix(y_test, y_pred_SV)
# Plot confusion matrix using heatmap
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues')
# create xlabel Name Predicted label
plt.xlabel('Predicted label')
# create ylabel True label
plt.ylabel('True label')
# create title Support Vector Confusion Matrix
plt.title('Support Vector Confusion Matrix')
# display confusion matrix
plt.show()
# Create confusion matrix 
cm = confusion_matrix(y_test, y_pred_SV)
# Create classification report
cr = classification_report(y_test, y_pred_SV)
# print confusion_matrix
print(cm)
# print classification_report
print(cr)
[[1152  256   10]
 [ 170 1087  197]
 [   3  265 1183]]
              precision    recall  f1-score   support

     SalaryA       0.87      0.81      0.84      1418
     SalaryB       0.68      0.75      0.71      1454
     SalaryC       0.85      0.82      0.83      1451

    accuracy                           0.79      4323
   macro avg       0.80      0.79      0.79      4323
weighted avg       0.80      0.79      0.79      4323

This code creates a confusion matrix to evaluate the performance of a Support Vector Machine (SVM) model on the test set. The confusion_matrix() function from the scikit-learn library is used to compute the confusion matrix, and the resulting matrix is stored in the variable cm.

The heatmap() function from the seaborn library is then used to create a heatmap visualization of the confusion matrix, with the annot=True parameter used to display the values of the matrix in each cell. Appropriate axis labels and a title are added to the plot using the xlabel(), ylabel(), and title() functions.

The code also creates a classification report using the classification_report() function from the scikit-learn library, which provides a detailed summary of the model's performance on each class in the target variable. The resulting confusion matrix and classification report provide insights into the model's performance and can be used to identify areas for improvement in the model.

The resulting confusion matrix and classification report allow for the evaluation of the SVM model's performance on the prediction task at hand, and can be used to compare its performance to other models in the analysis.

By creating and evaluating the performance of an SVM model, the code is exploring the use of a powerful and versatile algorithm that can be used for both classification and regression tasks. SVM works by finding the optimal hyperplane that separates the data points into their respective classes, with the goal of maximizing the margin between the hyperplane and the closestpoints. This can lead to improved accuracy and performance compared to other linear classification models.

ClassificationCompare¶

In [102]:
model_names = ['Log','DT', 'RF', 'GB', 'AB', 'SVC']
# Train and evaluate each model
accuracies = []
accuracies.append(accuracyLOG)
accuracies.append(accuracyDT)
accuracies.append(accuracyRF)
accuracies.append(accuracyGB)
accuracies.append(accuracyAB)
accuracies.append(accuracySVC)

# Create a dataframe to store the evaluation metrics
evaluation_df = pd.DataFrame({'Model': model_names,
                              'Accuracy': accuracies
                              })

# Print the evaluation metrics for each model
print(evaluation_df)

# Create a bar plot to compare the accuracy of the models
plt.bar(model_names, accuracies)
plt.title('Model Accuracy')
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.show()

# Print the best model based on accuracy
best_model = evaluation_df.loc[evaluation_df['Accuracy'].idxmax(), 'Model']
print(f'Best model: {best_model}')
  Model  Accuracy
0   Log   0.71964
1    DT   0.75156
2    RF   0.81726
3    GB   0.81749
4    AB   0.78071
5   SVC   0.79158
Best model: GB

This code trains and evaluates multiple classification models using decision trees, random forests, gradient boosting, AdaBoost, and support vector machines (SVM).

The model_names list is created to store the names of each model for later use in the analysis.

The accuracy of each model is computed and stored in the accuracies list using the accuracy_score() function from the scikit-learn library.

A dataframe called evaluation_df is created to store the evaluation metrics for each model, including the model name and accuracy. This dataframe is printed to the console using the print() function.

A bar plot is created using the bar() function from the matplotlib library to compare the accuracy of the models. Appropriate axis labels and a title are added to the plot using the xlabel(), ylabel(), and title() functions.

The idxmax() function is used to find the index of the highest accuracy value in the evaluation_df dataframe, and the corresponding model name is printed to the console using the print() function. This allows us to identify the best-performing model based on accuracy.

In [ ]:
 

DeploymentReg¶

In [103]:
filename='Random_Forest_Model_Regression.joblib'
joblib.dump(rf,filename)
Out[103]:
['Random_Forest_Model_Regression.joblib']
In [104]:
loaded_model=joblib.load(filename)
Y_Pred=loaded_model.predict([[3,1,1180,1,0,7,1180,0,47.5,5650,24092]])
Y_Pred
Out[104]:
array([4334920.])
In [105]:
import pickle
with open('Random_Forest_Model_Regression.pkl', 'wb') as f:
    pickle.dump(rf, f)
In [106]:
df_copy[['bedrooms', 'bathrooms', 'sqft_living','floors','view','grade', 'sqft_above', 'sqft_basement',
         'lat','sqft_lot15','population','price']]
Out[106]:
bedrooms bathrooms sqft_living floors view grade sqft_above sqft_basement lat sqft_lot15 population price
0 -0.39871 -1.44752 -0.97980 -0.91544 -0.30577 -0.55883 -0.73468 -0.65869 -0.35251 -0.26072 -0.59278 221900
1 -0.39871 0.17556 0.53370 0.93644 -0.30577 -0.55883 0.46081 0.24531 1.16158 -0.18788 0.57437 538000
2 -1.47390 -1.44752 -1.42623 -0.91544 -0.30577 -1.40955 -1.22979 -0.65869 1.28355 -0.17239 -0.92283 180000
3 0.67648 1.14941 -0.13050 -0.91544 -0.30577 -0.55883 -0.89167 1.39791 -0.28323 -0.28453 -1.43043 604000
4 -0.39871 -0.14905 -0.43538 -0.91544 -0.30577 0.29189 -0.13090 -0.65869 0.40959 -0.19286 -0.44398 510000
... ... ... ... ... ... ... ... ... ... ... ... ...
21608 -0.39871 0.50018 -0.59871 2.78832 -0.30577 0.29189 -0.31203 -0.65869 1.00498 -0.41238 1.36782 360000
21609 0.67648 0.50018 0.25059 0.93644 -0.30577 0.29189 0.62987 -0.65869 -0.35612 -0.20396 -0.42834 400000
21610 -1.47390 -1.77214 -1.15402 0.93644 -0.30577 -0.55883 -0.92789 -0.65869 0.24793 -0.39414 -0.34217 402101
21611 -0.39871 0.50018 -0.52249 0.93644 -0.30577 0.29189 -0.22750 -0.65869 -0.18436 -0.42051 -0.40867 400000
21612 -1.47390 -1.77214 -1.15402 0.93644 -0.30577 -0.55883 -0.92789 -0.65869 0.24576 -0.41795 -0.34217 325000

21611 rows × 12 columns

DeploymentClas¶

In [107]:
filenameC='Random_Forset_Model_Classification.joblib'
joblib.dump(gbc,filenameC)
Out[107]:
['Random_Forset_Model_Classification.joblib']
In [108]:
loaded_model=joblib.load(filenameC)
Y_Pred=loaded_model.predict([[3,1,1180,1,0,7,1180,0,47.5,5650,24092]])
Y_Pred
Out[108]:
array(['SalaryC'], dtype=object)
In [109]:
import pickle
with open('Random_Forset_Model_Classification.pkl', 'wb') as f:
    pickle.dump(rfc, f)
In [110]:
df_copy1[['bedrooms', 'bathrooms', 'sqft_living','floors','view','grade', 'sqft_above', 'sqft_basement',
         'lat','sqft_lot15','population','price']]
Out[110]:
bedrooms bathrooms sqft_living floors view grade sqft_above sqft_basement lat sqft_lot15 population price
0 -0.39871 -1.44752 -0.97980 -0.91544 -0.30577 -0.55883 -0.73468 -0.65869 -0.35251 -0.26072 -0.59278 SalaryA
1 -0.39871 0.17556 0.53370 0.93644 -0.30577 -0.55883 0.46081 0.24531 1.16158 -0.18788 0.57437 SalaryB
2 -1.47390 -1.44752 -1.42623 -0.91544 -0.30577 -1.40955 -1.22979 -0.65869 1.28355 -0.17239 -0.92283 SalaryA
3 0.67648 1.14941 -0.13050 -0.91544 -0.30577 -0.55883 -0.89167 1.39791 -0.28323 -0.28453 -1.43043 SalaryC
4 -0.39871 -0.14905 -0.43538 -0.91544 -0.30577 0.29189 -0.13090 -0.65869 0.40959 -0.19286 -0.44398 SalaryB
... ... ... ... ... ... ... ... ... ... ... ... ...
21608 -0.39871 0.50018 -0.59871 2.78832 -0.30577 0.29189 -0.31203 -0.65869 1.00498 -0.41238 1.36782 SalaryA
21609 0.67648 0.50018 0.25059 0.93644 -0.30577 0.29189 0.62987 -0.65869 -0.35612 -0.20396 -0.42834 SalaryB
21610 -1.47390 -1.77214 -1.15402 0.93644 -0.30577 -0.55883 -0.92789 -0.65869 0.24793 -0.39414 -0.34217 SalaryB
21611 -0.39871 0.50018 -0.52249 0.93644 -0.30577 0.29189 -0.22750 -0.65869 -0.18436 -0.42051 -0.40867 SalaryB
21612 -1.47390 -1.77214 -1.15402 0.93644 -0.30577 -0.55883 -0.92789 -0.65869 0.24576 -0.41795 -0.34217 SalaryA

21611 rows × 12 columns

Conclusion¶

The provided code shows a sample of a housing dataset containing various features such as price, number of bedrooms and bathrooms, living area, lot size, floors, waterfront, view, grade, year built, and others. The analysis of this dataset includes data exploration, manipulation, visualization, and machine learning techniques to predict the housing prices and classify the properties.

The machine learning section includes regression and classification models, which aims to predict the housing prices or classify the properties based on their features. The exploratory data analysis includes analyzing and visualizing the relationships between the features and the target variable, identifying outliers, and understanding the distribution of the data.

The insights gained from this analysis could be useful for various stakeholders, such as home buyers, real estate agents, and property developers. The developed models provide a way to accurately predict the housing prices or classify the properties based on certain features, which could assist in making informed decisions about buying or selling properties. Overall, this analysis provides a valuable contribution to the field of real estate by identifying the factors that influence housing prices and developing accurate prediction models.

FutureWork¶

Although the analysis of the housing dataset has provided valuable insights and developed accurate prediction models, there are still areas for further research and improvement. Some potential future work includes:

  1. Incorporating additional features: The dataset used in this analysis includes various features, but there may be other features that could influence housing prices, such as crime rates, proximity to schools or public transportation, and nearby amenities. Incorporating these features could improve the accuracy of the prediction models.

  2. Improving the models' performance: Although the developed models have high accuracy, there is still room for improvement. Techniques such as ensemble learning, feature selection, and hyperparameter tuning could be used to further enhance the models' performance.

  3. Evaluating the models' generalizability: The developed models were trained and tested on a specific dataset, so it is important to evaluate their generalizability to other datasets or real-world scenarios. Cross-validation techniques and testing on new datasets could be used to assess the models' generalizability.

  4. Exploring interpretability: While the developed models have high accuracy, they may lack interpretability, meaning it may not be clear which features are driving the predictions. Exploring interpretability techniques such as feature importance and partial dependence plots could provide insights into the models' decision-making processes.

Overall, these potential future work areas could further improve the accuracy and applicability of the developed models, providing valuable insights for real estate stakeholders and contributing to the field of real estate research.